visit
A Similar Image Retrieval (aka Content-Based Image Retrieval or CBIR) is any search that involves images.
The picture above from the article
Nowadays, the "Search by photo" approach is being used actively, in particular, in e-commerce services (AliExpress, Wildberries, etc.). "Keyword search" (with an understanding of the content of images) has long settled in the search engines Google, Yandex, etc., but has not yet reached marketplaces and other private search engines. I think that since the appearance of the sensational
Step 1. Train the model. The model can be made on the classic CV or on the basis of a neural network. Model input - image, output - D-dimensional descriptor/embedding. In the case of the classics, it can be a combination of SIFT-descriptor + Bag of Visual Words. In the case of a neural network, a standard backbone like ResNet, EfficientNet, etc. + intricate pooling layers + clever learning techniques, which we will talk about later. I can say that if you have enough data or good practice, neural networks will almost always benefit greatly (we checked), so we will focus on them.
Step 2. Indexing the base of images. Indexing is a running of the trained model on all images and writing embeddings to a special index for quick search.
Step 3. Search. Using the image uploaded by the user, the model is run, the embedding is obtained and this embedding is compared with the rest in the database. The search result is the search results sorted by relevance.
The neural network in the task of finding similarities is used as a feature extractor (backbone). The choice of backbone depends on the volume and complexity of the data - you can consider everything from
The first feature of the models in Image Retrieval is the magic in the head of the neural network. On the____, they are fighting for building the best descriptors - there are Combined Global Descriptors with parallel pools and Batch Drop Block for a more even distribution of activation over the output feature map.
The second main feature is the loss functions. There are a lot of them. In
This is a double loss, i.e. objects are compared by the distance between each other.
The neural network is penalized for the distance from each other of the embeddings of images p and q if these images are actually similar. Similarly, there is a penalty for the proximity of embeddings, the images of which are actually different from each other. In this case, in the latter case, we set the boundary m (for example, 0.5), overcoming which, we believe that the neural network has coped with the task of "separating" dissimilar images. Here we aim to minimize the distance from the anchor to the positive and maximize the distance from the anchor to the negative. Triplet Loss was first introduced in Google's
N-tupled Loss- a development of Triplet Loss, which also takes anchor and positive, but uses multiple negatives instead of one negative.
The problem of paired losses is in choosing combinations of positives, negatives, and anchors - if they are just taken uniformly random from the dataset, then the problem of "light pairs" arises. These are such simple pairs of images for which the loss will be 0. It turns out that the network converges quickly enough to a state in which most of the elements in the batch will be "easy" for it, and the loss for them will turn out to be zero - the network will stop learning. To avoid this problem, they began to come up with sophisticated mining techniques for pairs - hard negative and hard positive mining. You can read more about the problem in . There is also a , which implements many mining methods, and in general, the library contains a lot of useful information on the Metric Learning task on PyTorch.
Another solution to the problem is classification losses. Consider one popular bug feature that led to state-of-the-art facial recognition three years ago - .
Here I visually show you which loss functions are best used when you have single-class and multi-class markup (you can derive paired markup from the latter by counting the percentage of intersection between multilabel vectors of examples).
is a pooling layer that accepts the output map of the neural network (before the global pooling or classification layers) and returns a descriptor vector calculated as the sum of activations in different windows of the output map. Here, the activation of the window is to take the maximum for this window for each channel independently.
Generalized Mean (GeM) is a simple pooling that can improve the quality of the output descriptor. The bottom line is that the classical average pooling can be generalized to the lambda norm. By increasing the lambda, we make the network focus on significant parts of the image, which can be important in certain tasks.
These problems can be solved at the expense of quality - to store embeddings, not in their original form, but compressed (quantized). And also to change the search strategy - not to search for brute-force, but to try for the minimum number of comparisons to find the required number of those closest to the given query. There are a large number of efficient frameworks for approximate search for the nearest ones. A special
Most Popular:
One such method is Query Expansion. The idea is to use the top-k of the closest elements to generate a new embedding. In the simplest case, you can take the averaged vector, as shown in the picture above. You can also weight embeddings, for example, by distance in the issue or cosine distance from the request. Such improvements are described in a single framework in the article
k-reciprocal - a set of elements from top-k, including the k nearest of which is the request itself. On the basis of this set, the process of re-ranking the results is built, one of which is described in the article
Pros
Cons
Same as precision@k, where k is set equal to the number of relevant queries. Pros
Shows what proportion of relevant items were found in top-k Pros
You can read more about metrics in Information Retrieval, including the mAP output,
This metric shows how correctly the elements in top-k are ordered among themselves. We will not consider the pros and cons of this metric, since this is the only metric in our list that takes into account the order of elements. However, there are studies showing that this metric is fairly stable when it is necessary to take into account the order and can be suitable in most cases.
macro: a metric is calculated for each request, averaging over all requests
pros: there are no significant fluctuations with a different number of relevant to this query cons: all queries are considered equal, even if some are more relevant than others
micro: the number of tagged relevant and separately successfully found relevant is summed over all queries, then participates in the fraction of the corresponding metric
pros: queries are evaluated taking into account the number of tagged relevant for each of them cons: the metric can become very low / very high if there were a lot of tagged relevant ones for some request and the system unsuccessfully / successfully brought them to the topInput: image requests and images relevant to them. There is markup in the form of a list relevant to this query. To calculate metrics, you can calculate the relevance matrix each with each and, based on the binary information about the relevance of the elements, calculate the metrics.
Input: image requests, and images relevant to them. There should also be a validation database of images, in which, ideally, all relevant queries are marked. Also, it should not contain query images, otherwise, they will have to be cleaned at the search stage so that they do not clog the top-1. The validation base is involved as a base of negatives - our task is to draw out those that are relevant to it. To calculate metrics, you can go through all the requests, calculate the distances to all elements, including relevant ones, and send them to the metrics calculation function.
Some companies argue with other companies that the latter should not use the pictorial elements of the former's brand. In such cases, a weaker manufacturer tries to parasitize on the success of a large brand by presenting its products and services under similar symbols. Users also suffer from this - you can mistakenly buy cheese from the wrong manufacturer, which you already trust, but take a fake product without carefully reading the label. An example from a recent one:
There are special companies, governmental and private, to fight illegal immigration. They have a register of registered trademarks by which they can compare new incoming trademarks and approve / reject an application for the registration of a trademark. Above is an example of the interface of the