• Homepage
  • Papers
  • Supplementary material for: A Strong Baseline for Fashion Retrieval with Person Re-Identification models

Supplementary material for: A Strong Baseline for Fashion Retrieval with Person Re-Identification models

15 min read

 

This article is Supplementary material for: A Strong Baseline for Fashion Retrieval with Person Re-Identification models

In the following sections of supplementary material we provide loss function definitions used in fashion retrieval domain (Section 1) along with metrics used (Section 2). Then we describe two open datasets used in our tests which are commonly used in the domain (Section 3). We also demonstrate a few problems with the datasets along with our adjustments for the community to use them in more convenient way using the popular COCO format.

Finally we list a few examples of outputs of our models, and also examples with re-ranking technique and without them (Section 5).

In the following sections of supplementary material we provide loss function definitions used in fashion retrieval domain (Section 1) along with metrics used (Section 2). Then we describe two open datasets used in our tests which are commonly used in the domain (Section 3). We also demonstrate a few problems with the datasets along with our adjustments for the community to use them in more convenient way using the popular COCO format.

Finally we list a few examples of outputs of our models, and also examples with re-ranking technique and without them (Section 5).

1 Loss functions

In the image retrieval task there are two loss functions commonly used: clas- sification and triplet loss. Therefore, prevailing number of works in the image retrieval domain use a combination of a classification and a triplet loss for train- ing deep learning models. Classification loss function is used to identify exact id of a person/garment (i.e. images of the same person/garment have the same id). It is a standard loss in classification tasks and in our case it is cross-entropy loss. Also all of the considered models use either of these two loss functions.

Deep metric learning treats the image retrieval problem somewhat as a clus- tering or ranking problem and optimizes the mutual arrangement of the embed- dings in space. One of the most widely used approaches that pulls the same class closely and pushes away other class embeddings is using a triplet loss [26] for training neural network.

Triplet loss is formulated as follows:

where [z]+ = max(z, 0) and f denotes learnt embedding function applied to all data points.

A triplet loss consists of an anchor image (a query) A, a positive example P – the other image of the same object (in this paper – clothing item) present in the A image – and negative sample N, which is an image of a different object from that shown in the image A.

Learning NN with triplet loss minimizes the intra-class distance, between anchor and positive samples, and maximizes inter-class distance, between an- chor and negative samples. The triplet loss proved to achieve state-of-the-art performance and became a standard in similarity learning tasks [33], [26], [9].

In triplet loss strategy, the method of creating triplets is an important part of the training and influences the model performance immensely [36]. In [26] authors used semi-hard triplets, where negative samples are further away from the anchor than the positive samples, but still the loss value is positive, thus, it allows learning.

Most of the works we examined use online hard negative sampling to form a training triplet. This methods select such data points, so that the negative sample is closer to the anchor than the positive sample. As a results, the neutral network is given only the triplets that maximize the value of the loss function, therefore it is called ’hard’. This method of creating triplets proved to perform better than other sampling methods and is used in numerous works [27], [40], [22], [31].

To further improve the triplet loss some authors either extends the number of tuples in the loss [1], [28], [34] or/and propose novel sampling methods [8], [36]. However, the reported improvements are not high, thus, we did not use them in our experiments.

Triplet, and in general n-tuple-loss, aims to properly arrange embeddings in an n-dimensional space. While the triplet loss is a common choice in re- trieval/ranking tasks, we also examined the quadruplet loss and its influence on the performance. Our implementation of the quadruplet loss follows one found in [1]:

where the first term is the same as in Equation 1, thus, it takes care of the pull-push relation between anchor, positive and negative samples. The second term demands the intra-class distance to be smaller than the maximum inter- class distance in respect to a different probe - N2 (N2 ̸= N1 ⇒ N2 and N1 represents different garments/IDs). α1 and α2 are the margin values, which are set dynamically as in [1].

In [22] additionally center loss [35] is used as one of the training tricks. It aims to pull same class embeddings together as the n-tuple-loss considers only relative distance between embeddings neglecting the distance absolute values. Center loss alleviate this problem by penalizing the distance between embeddings and their id/class center. Formula for center loss is as follows:

where yj denotes label of j-th image in the mini-batch. B is the batch size, ftj is an embedding of j-th image and cyj is the center of yj-th class features center.

2 Metrics in fashion retrieval

To evaluate the performance of our approach we used metrics that we found most often in the related papers. Most widely used metric in retrieval tasks is Accuracy@k (Acc@k), formulated as:

Accuracy@k (Acc@k) formula

where N is the number of queries and 1[Sq/+ ∩ SK/q]  is an indicator function, which evaluates to 1 if the ground-truth image is within top-k retrieved results. k is usually set from 1 to 20. The metric measures if the retrieved item was among top-k proposals.

Second metric that we encountered in the papers was mAP , which is a mean average precision, that shows how well the retrieval is done on average. Though mAP values were rarely reported in clothes retrieval papers we decided to use this metric in our experiments along Acc@k.

3 Datasets

In this section we describe datasets used for evaluation. Apart from describing their statistics, we also explain the process of reformatting them and how they were processed during our experiments.

3.1 Street2Shop

The dataset was introduced by [14] and became one of the most widely used datasets for evaluating clothes retrieval solutions. Therefore there is an abundance of works that present their results on the dataset, thus, providing a strong benchmark for our methods. It contains 404,683 shop photos and 20,357 street photos depicting 204,795 distinct clothing items [14]. To allow compatibility across datasets and models we tested, we transformed the Street2Shop dataset to COCO-format, while keeping original train/test split and categorization. Annotations in COCO-format are available on our GitHub.

In contrast to some authors [15] we decided not to perform any data clean- ing or hand-curating images/annotations, even though we encountered some erroneous annotations such as multiple annotations for a single-item image or bounding boxes placed in ’random’ places (see examples in Fig. )).

We made such decision to allow a fair comparison with [16], which we found to have best performance on Street2Shop dataset, while it does not mention any data cleaning.

3.2 DeepFashion

The dataset contributed by [21] contains over 800,00 images, but for our task we only used Consumer-to-shop Clothes Retrieval subset that consists of 33,881 distinct clothing items and total of 239,557 images, creating 195,540 pairs. We used results found in [3] as our benchmark, since their were the best we found.

Similarly to Street2Shop dataset, DeepFashion is also not free from some defects, which we show in Figures 1, 2, 3, 4.

In Consumer-to-Shop subset of DeepFashion , we found out that the same products and even the same images were assigned different product identifiers. As a result, the retrieval performance is falsely understated compared to the real performance. Two examples of such errors are presented in the supplementary materials to this paper

A collection of photos that depict, we believe, the same clothing item - dress visible in the left most photo.
Fig. 1.  A collection of photos that depict, we believe, the same clothing item - dress visible in the left most photo. Above each photo there are four pieces of information; from the top: item id, file name, category, subset name. Despite the fact that item ids should be unique for distinct garments, it seems that the same item have various ids assigned, which results in erroneous retrieval results presented in Figure 2

4 Code improvements

As we mentioned in the main text, we encountered problems with vRAM and RAM overflow caused by the size of the datasets we tested. The RST model contains fully connected layers used for classification of identities/clothes. While ReID datasets the RST model was tested on Market-1501, DukeMTMC-reid contain 1501 and 1812 unique identities, the fashion datasets have an order of 

Examples of retrieval for the query images with product ids from presented in Figure 1 produced by our best model on 320x320 images.
Fig.2. Examples of retrieval for the query images with product ids from presented in Figure 1 produced by our best model on 320x320 images. The images in the first column are query images, while the images on their right are the retrieval results with decreasing similarity towards the right side. Retrieval images with green border are the true match to the query. The top 10 most similar retrieval images are shown. It can be seen that some images that are just mirrored copies of the same image, yet only one of them is deemed as a true match. We believe it is an error in data annotation, which understates real retrieval performance.
ollection of photos that depict, we believe, the same clothing item - top visible in the left most photo
Fig.3. A collection of photos that depict, we believe, the same clothing item - top visible in the left most photo. Above each photo there are four pieces of information; from the top: item id, file name, category, subset name. Despite the fact that item ids should be unique for distinct garments, it seems that the same item have various ids assigned, which results in erroneous retrieval results presented in Figure 4. Interestingly, the right most photos depicts a top that is plain white, which also seems to be incorrect compared the rest.
Examples of retrieval for the query images with product ids from presented in Figure 3 produced by our best model on 320x320 images.
Fig.4. Examples of retrieval for the query images with product ids from presented in Figure 3 produced by our best model on 320x320 images. The images in the first column are query images, while the images on their right are the retrieval results with decreasing similarity towards the right side. Retrieval images with green border are the true match to the query. The top 10 most similar retrieval images are shown. It can be seen that all results have first and second image the same, while in all cases only the latter is correct even though the former is also from ’top’ category, thus, it seems to be pertaining exactly the same garment.

magnitude more identities (clothes) roughly 10000-15000. As a result, the FC layer needs thousands neurons instead of hundreds, thus, it requires much more video RAM (vRAM. To address this problem we introduced two independent solutions:

  1. Gradient accumulation - it allows to use smaller mini-batches in constrained vRAM settings when the mini-batch either do not fit into GPU memory or is too small causing the gradient descent to be volatile and prevent model from converging. Gradient accumulation splits original mini-batch into sub- mini-batches, feeds them into network, compute gradients, but the model weights are updated after all sub-mini-batches.
  2. ModelParallel mode - it is a distributed training technique that splits a single model across different devices. The data and gradients are moved between devices during forward and backward pass. It was implemented to allow us to test larger backbones that would not fit into a single GPU, use larger mini-batches and, thus speed up the training.

In the code version we used, Resnet-50-IBNs were not yet implemented, therefore we implemented both A and B variants by ourselves based on original implementation. Additionally, we expanded available backbones with the whole EfficientNet family based on the implementations available here.
During evaluation step, especially when performing re-ranking, we encoun- tered problem with RAM consumption. Again, the problem arises from the size of the clothes retrieval datasets we tested. To tackle the problem we introduced three solutions:

  1. We introduced batch processing during both creating images’ embeddings, to avoid vRAM overflow, and during computation of distance matrix for tens of thousands images. Originally, both operations were performed in one go.
  2. Conducting evaluation with re-ranking for single categories for Street2Shop was still problematic due to large RAM requirements, so we used batch processing again, but, we appended intermediate results from batch compu- tations of distance matrix to a file in a hard drive. During re-ranking itself we used Numpy function memmap to avoid reading the whole matrix into RAM, while still allowing RAM-like processing.
  3. Unfortunately, evaluation with re-ranking and without category constraint was still impossible for Street2Shop , as the whole distance matrix over 400,000x400,000 floats, was again too big even when using memap function. We decided to conservatively estimate the results by calculating weighted average over categories and deducting a penalty term. The penalty term was computed for each metric separately using results from the model vari- ant without re-ranking, as the maximum difference, between the weighted average over categories and Unconstrained Retrieval values.

Finally, we added Accuracy@k computation for specified k and a script that at the end of evaluation creates visualization of the results. It plots query image and top-k retrieved images.

5 Our result examples

mples of retrieval on DeepFashion dataset produced by our best model on 320x320 images.
Fig. 5. Examples of retrieval on DeepFashion dataset produced by our best model on 320x320 images. The images in the first column are query images, while the images on their right are the retrieval results with decreasing similarity towards the right side. Retrieval images with green border are the true match to the query. The top 10 most similar retrieval images are shown. Two retrieval results are shown for each query image, one without and one with re-ranking. The top result from a pair is without re-ranking.
Fig. 6. More retrieval results on DeepFashion dataset without and with re-ranking.
More retrieval results on DeepFashion dataset without and with re-ranking.
Fig. 7. More retrieval results on DeepFashion dataset without and with re-ranking.
Examples of retrieval on Street2Shop dataset produced by our best model on 320x320 images.
Fig. 8. Examples of retrieval on Street2Shop dataset produced by our best model on 320x320 images. The images in the first column are query images, while the images on their right are the retrieval results with decreasing similarity towards the right side. Retrieval images with green border are the true match to the query. The top 10 most similar retrieval images are shown. Two retrieval results are shown for each query image, one without and one with re-ranking. The top result from a pair is without re-ranking.
More retrieval results on Street2Shop dataset without and with re-ranking.
Fig. 9. More retrieval results on Street2Shop dataset without and with re-ranking.
More retrieval results on Street2Shop dataset without and with re-ranking.
Fig. 10. More retrieval results on Street2Shop dataset without and with re-ranking.

Authors:

Mikolaj Wieczorek, Synerise
Andrzej Michalowski, Synerise
Jacek Dabrowski, Synerise [0000−0002−1581−2365],
Anna Wroblewska, Synerise and Faculty of Mathematics and Information Science, Warsaw University of Technology [0000−0002−3407−7570]