Learning Embeddings for Fashion Images

Thesis Project owned by RISE English
1y ago update
Author: Simon Hermansson

University: Linköping University


Evaluating the price of second-hand clothes is vital to determine if it is worth it to resell an item of clothing, export it to another country, or recycle it. Human sorters use many different criteria when evaluating the price of clothing, such as brand, condition, type, and knowledge of current trends. This is a time-consuming, and likely error-prone process. This thesis will investigate whether it is possible to use deep learning to automate and improve the evaluation process.

This thesis is conducted at RISE Research Institutes of Sweden as part of a project exploring the use of AI for resource-efficient circular fashion. Aside from price prediction, there is an additional goal to ease the manual evaluation process by developing an image search engine for second-hand clothes. This search engine will be able to retrieve images of similar clothes given a query image or query text along with their previously estimated prices. This tool will be used by human sorters to help with manual sorting.

To train a machine learning model to learn fine-grained details such as brand and condition, as much data as possible is wanted, and even with several datasets containing images of second-hand clothes available, their sizes still pale in comparison to the sizes of commonly used datasets such as ImageNet, which has millions of images. This is why transfer learning, self-supervised learning, and natural language supervised learning are explored in this thesis. In the self-supervised learning paradigm, a model can be trained using unlabeled data to create a latent space that groups embeddings of similar images close to each other, while keeping embeddings of dissimilar images far apart from one another. With natural language supervised learning, models can instead learn by observing image-text pairs, with the text describing the contents of the images. The weights generated after either training process can then be fine-tuned for a specific task using supervised learning with a labeled dataset.


Two models were examined: CLIP, a multi-modal model, and MAE, a self-supervised model. Quantitatively, the results favored CLIP, which outperformed MAE in both image retrieval and prediction. However, MAE may still be useful for some applications in terms of image retrieval as it returns items that look similar, even if they do not necessarily have the same attributes. In contrast, CLIP is better at accurately retrieving garments with as many matching attributes as possible.

Image retrieval results were generally very good, with it being possible to retrieve similar garments using either images or text (with CLIP). Price and intended usage prediction were more difficult, but improvements over a random model were made, and the performance could be increased additionally when adding additional attributes such as the brand of the garments. Since the main dataset used is still being collected, evaluation was done on a subset of it and the evaluation will be done again once the entire dataset is completed.

Link to thesis on DiVA