GLAMI-1M: A Multilingual Image-Text Fashion Dataset

Vaclav Kosar (GLAMI), Antonín Hoskovec (GLAMI), Milan Šulc (Rossum.ai),* Radek Bartyzal (GLAMI)

The 33^rd British Machine Vision Conference

Abstract

We introduce GLAMI-1M: the largest multilingual image-text classification dataset and benchmark. The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages. Categorization into 191 classes has high-quality annotations: all 100k images in the test set and 75% of the 1M training set were human-labeled. The paper presents baselines for image-text classification showing that the dataset presents a challenging fine-grained classification problem: The best scoring EmbraceNet model using both visual and textual features achieves 69.7% accuracy. Experiments with a modified Imagen model show the dataset is also suitable for image generation conditioned on text.

Video

Citation

@inproceedings{Kosar_2022_BMVC,
author    = {Vaclav Kosar and Antonín Hoskovec and Milan Šulc and Radek Bartyzal},
title     = {GLAMI-1M: A Multilingual Image-Text Fashion Dataset},
booktitle = {33rd British Machine Vision Conference 2022, {BMVC} 2022, London, UK, November 21-24, 2022},
publisher = {{BMVA} Press},
year      = {2022},
url       = {https://bmvc2022.mpi-inf.mpg.de/0607.pdf}
}

Copyright © 2022 The British Machine Vision Association and Society for Pattern Recognition
The British Machine Vision Conference is organised by The British Machine Vision Association and Society for Pattern Recognition. The Association is a Company limited by guarantee, No.2543446, and a non-profit-making body, registered in England and Wales as Charity No.1002307 (Registered Office: Dept. of Computer Science, Durham University, South Road, Durham, DH1 3LE, UK).

Imprint | Data Protection