GLAMI-1M: A Multilingual Image-Text Fashion Dataset
- URL: http://arxiv.org/abs/2211.14451v1
- Date: Thu, 17 Nov 2022 13:19:07 GMT
- Title: GLAMI-1M: A Multilingual Image-Text Fashion Dataset
- Authors: Vaclav Kosar, Anton\'in Hoskovec, Milan \v{S}ulc, Radek Bartyzal
- Abstract summary: GLAMI-1M is the largest multilingual image-text classification dataset and benchmark.
The dataset contains images of fashion products with item descriptions, each in 1 of 13 languages.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce GLAMI-1M: the largest multilingual image-text classification
dataset and benchmark. The dataset contains images of fashion products with
item descriptions, each in 1 of 13 languages. Categorization into 191 classes
has high-quality annotations: all 100k images in the test set and 75% of the 1M
training set were human-labeled. The paper presents baselines for image-text
classification showing that the dataset presents a challenging fine-grained
classification problem: The best scoring EmbraceNet model using both visual and
textual features achieves 69.7% accuracy. Experiments with a modified Imagen
model show the dataset is also suitable for image generation conditioned on
text. The dataset, source code and model checkpoints are published at
https://github.com/glami/glami-1m
Related papers
- Toffee: Efficient Million-Scale Dataset Construction for Subject-Driven Text-to-Image Generation [58.09421301921607]
We construct the first large-scale dataset for subject-driven image editing and generation.
Our dataset is 5 times the size of previous largest dataset, yet our cost is tens of thousands of GPU hours lower.
arXiv Detail & Related papers (2024-06-13T16:40:39Z) - OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - TagAlign: Improving Vision-Language Alignment with Multi-Tag Classification [59.779532652634295]
We propose an embarrassingly simple approach to better align image and text features with no need of additional data formats other than image-text pairs.
We parse objects and attributes from the description, which are highly likely to exist in the image.
Experiments substantiate the average 5.2% improvement of our framework over existing alternatives.
arXiv Detail & Related papers (2023-12-21T18:59:06Z) - A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions [9.87625120950535]
We collect the Densely Captioned Images dataset, containing 7805 natural images human-annotated with mask-aligned descriptions.
With precise and reliable captions associated with specific parts of an image, we can evaluate vision-language models' understanding of image content.
We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark.
arXiv Detail & Related papers (2023-12-14T00:42:23Z) - Filter & Align: Leveraging Human Knowledge to Curate Image-Text Data [31.507451966555383]
We present a novel algorithm that incorporates human knowledge on image-text alignment to guide filtering vast corpus of web-crawled image-text datasets.
We collect a diverse image-text dataset where each image is associated with multiple captions from various sources.
We train a reward model on these human-preference annotations to internalize the nuanced human understanding of image-text alignment.
arXiv Detail & Related papers (2023-12-11T05:57:09Z) - Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets.
We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels.
We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z) - GIST: Generating Image-Specific Text for Fine-grained Object
Classification [8.118079247462425]
GIST is a method for generating image-specific fine-grained text descriptions from image-only datasets.
Our method achieves an average improvement of $4.1%$ in accuracy over CLIP linear probes.
arXiv Detail & Related papers (2023-07-21T02:47:18Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
Documents [122.55393759474181]
We introduce OBELICS, an open web-scale filtered dataset of interleaved image-text documents.
We describe the dataset creation process, present comprehensive filtering rules, and provide an analysis of the dataset's content.
We train vision and language models of 9 and 80 billion parameters named IDEFICS, and obtain competitive performance on different multimodal benchmarks.
arXiv Detail & Related papers (2023-06-21T14:01:01Z) - Referring Image Matting [85.77905619102802]
We introduce a new task named Referring Image Matting (RIM) in this paper.
RIM aims to extract the meticulous alpha matte of the specific object that best matches the given natural language description.
RefMatte consists of 230 object categories, 47,500 images, 118,749 expression-region entities, and 474,996 expressions.
arXiv Detail & Related papers (2022-06-10T14:44:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.