• Homepage
  • Blog
  • A Strong Baseline for Fashion Retrieval with Person Re-Identification Models

A Strong Baseline for Fashion Retrieval with Person Re-Identification Models

15 min read
Synerise - Strong Baseline for Fashion Retrieval with Person Re-Identification Models

Fashion retrieval is the challenging task of finding an exact match for fashion items contained within an image. Difficulties arise from the fine-grained nature of clothing items, very large intra-class and inter-class variance. Additionally, query and source images for the task usually come from different domains - street photos and catalogue photos respectively. Due to these differences, a significant gap in quality, lighting, contrast, background clutter and item presentation exists between domains. As a result, fashion retrieval is an active field of research both in academia and the industry. Inspired by recent advancements in Person Re-Identification research, we adapt leading ReID models to be used in fashion retrieval tasks. We introduce a simple baseline model for fashion retrieval, significantly outperforming previous state-of-the-art results despite a much simpler architecture. We conduct in-depth experiments on Street2Shop and DeepFashion datasets and validate our results. Finally, we propose a cross-domain (cross-dataset) evaluation method to test the robustness of fashion retrieval models.

We have recently published paper about fashion retrieval. During our research ‪we achieved 54.8 mAP, 72.9 Acc@20 on Street2Shop and 47.3 Acc@1, 79.0 Acc@20 on DeepFashion dataset‬ - establishing new state-of-the-art results for both. We would like to share with You some ideas directly from our AI Lab.  

In the last 3 years, there's been a flood of research papers in AI and Machine Learning published by Chinese universities and companies. Strong incentives from government and industry, combined with the challenges of scale (almost 1.4Bn citizens) push the frontier of research in China forward at an impressive pace.

Although the topics are quite diverse, there is a category of papers which are overrepresented when compared to western research institutions' output. The papers cover various topics, but their possible applications are common: mass surveillance. [10],[9],[2],[6],[1],[5],[8]

One of the most popular applications of AI, is that of Person Re-Identification - which is linking the photos/videos from CCTV cameras to identities of citizens. As you can imagine, the task is quite challenging, due to various factors:

  • cameras have different quality
  • angles, lighting & visibility conditions vary
  • people can be partially occluded
  • people change clothes, put on hats, can wear sunglasses etc.
  • people travel between different areas, so location information is of limited usability
  • human body is capable of many different poses and movements
  • much more variety is present in the real-world compared to synthetic datasets.

While tracking the progress made by China is scary, yet fascinating, can their state-sponsored research be re-purposed to directly benefit the end-user?

In our paper [7] we propose an approach adapted from mass surveillance, which with some modifications, outperforms all prior research in fashion visual search.

The problem of fashion retrieval / visual search sounds simple - given a user-made photo of a clothing item, automatically pick most similar clothes from a store's assortment. The user may take a photo of his/her friend, a photo of an item in a store, or upload a photo found in the Internet.


Intuitively, representation learning should "distill the essence of visual identity and similarity" of objects, and disregard all modifications & transformations of input, which do not change similarity or identity.

Some examples of transformations which do not change visual identity/similarity:

  • facial expressions - while they can deform the face, they do not change a person's identity
  • clothes deformability - a crumpled sweater is still the same sweater
  • lighting, brightness, contrast, etc. - objects remain the same, while they look different
  • view angles, rotations, focal length, image resolution - they change the photo, but have no effect on the objects themselves
Source: BU-3DFE dataset facial expressions

There are a lot more real-world transformations which can confuse ML models, but are naturally disregarded by people when evaluating "identity" or "similarity", e.g. weather conditions, mechanical transformations etc. Visual representation learning aims to be resistant to these transformations.

When it comes to fashion, we're interested in vector representations of clothes, where the same "fashion item" gets the same or similar vectors, regardless of where, when and how the photo was taken. In contrast, when it comes to mass surveillance, we'd be interested in vector representations of people, where the same person get the same or similar vectors, regardless of where, when and how the photo was taken, what the person was wearing and what pose they were photographed in.

The two use-cases sound so much alike, that it's quite surprising that very little intellectual cross-pollination has happened between these areas until now.

In our paper [7], we identify the similarities and differences between fashion and mass surveillance in depth. Then we successfully transfer latest research from Person Re-Identification to fashion retrieval for visual search.

While the Person ReID models require some adjustments to work well on fashion datasets, the final results are quite extraordinary. Our best approach outperforms all prior published research in fashion retrieval and establishes new state-of-the art results on two commonly used datasets - DeepFashion and Street2Shop. The best model described in the paper is a foundation of our Visual Search product at Synerise, trained on our massive proprietary datasets.

What's especially worth noting, is that our strong baseline model is much simpler than some of the recent fashion-specific approaches. The simplicity is apparent with regard to architecture, training procedure and computational resources required.This should serve as a reminder that good foundations, proper abstractions and picking the right problem to solve are often key to unlocking significant progress in research. As unlikely as it sounds, fashion and surveillance have a lot in common when thinking in the framework of representation learning.

Here are some example results of our best model:

For more details & nice pictures check out our paper with the appendix: [7].

Jacek Dąbrowski / Jarek Krolewski

[1] Dong, C. et al. 2019. DeepMEF: A Deep Model Ensemble Framework for Video Based Multi-modal Person Identification. Proceedings of the 27th ACM International Conference on Multimedia (Nice, France, Oct. 2019), 2531–2534.

[2] Guo, Y. et al. 2019. Multi-Scale Convolutional Recurrent Neural Network with Ensemble Method for Weakly Labeled Sound Event Detection. 2019 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) (Sep. 2019), 1–5.

[3] Kuang, Z. et al. 2019. Fashion Retrieval via Graph Reasoning Networks on a Similarity Pyramid. arXiv:1908.11754 [cs]. (Aug. 2019).

[4] Kucer, M. and Murray, N. 2019. A Detect-Then-Retrieve Model for Multi-Domain Fashion Item Retrieval. CVPR Workshops. 10.

[5] Nie, J. et al. 2019. Understanding personality of portrait by social embedding visual features. Multimedia Tools and Applications. 78, 1 (Jan. 2019), 727–746.

[6] Song, W. et al. 2019. Partial Attribute-Driven Video Person Re-Identification. 2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI) (Nov. 2019), 539–546.

[7] Wieczorek, M. et al. 2020. A Strong Baseline for Fashion Retrieval with Person Re-Identification Models. arXiv:2003.04094 [cs]. (Mar. 2020).

[8] Wu, L. et al. 2019. A Neural Influence Diffusion Model for Social Recommendation. Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (Paris, France, Jul. 2019), 235–244.

[9] Zhang, X. et al. 2019. TVV: Real-Time Visual Identity and Tracking with Edge Computing. Proceedings of the 2019 International Conference on Embedded Wireless Systems and Networks (Beijing, China, Mar. 2019), 419–424.

[10] Zhang, Z. et al. 2018. Billion-Scale Network Embedding with Iterative Random Projection. 2018 IEEE International Conference on Data Mining (ICDM) (Nov. 2018), 787–796.