Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding
- URL: http://arxiv.org/abs/2401.04575v2
- Date: Tue, 5 Mar 2024 21:02:33 GMT
- Title: Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding
- Authors: Yatong Bai, Utsav Garg, Apaar Shanker, Haoming Zhang, Samyak Parajuli,
Erhan Bas, Isidora Filipovic, Amelia N. Chu, Eugenia D Fomitcheva, Elliot
Branson, Aerin Kim, Somayeh Sojoudi, Kyunghyun Cho
- Abstract summary: Let's Go Shopping dataset is a large-scale public dataset with 15 million image-caption pairs from publicly available e-commerce websites.
Our experiments show that the classifiers trained on existing benchmark datasets do not readily generalize to e-commerce data.
LGS enables image-captioning models to generate richer captions and helps text-to-image generation models achieve e-commerce style transfer.
- Score: 36.01657852250117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision and vision-language applications of neural networks, such as image
classification and captioning, rely on large-scale annotated datasets that
require non-trivial data-collecting processes. This time-consuming endeavor
hinders the emergence of large-scale datasets, limiting researchers and
practitioners to a small number of choices. Therefore, we seek more efficient
ways to collect and annotate images. Previous initiatives have gathered
captions from HTML alt-texts and crawled social media postings, but these data
sources suffer from noise, sparsity, or subjectivity. For this reason, we turn
to commercial shopping websites whose data meet three criteria: cleanliness,
informativeness, and fluency. We introduce the Let's Go Shopping (LGS) dataset,
a large-scale public dataset with 15 million image-caption pairs from publicly
available e-commerce websites. When compared with existing general-domain
datasets, the LGS images focus on the foreground object and have less complex
backgrounds. Our experiments on LGS show that the classifiers trained on
existing benchmark datasets do not readily generalize to e-commerce data, while
specific self-supervised visual feature extractors can better generalize.
Furthermore, LGS's high-quality e-commerce-focused images and bimodal nature
make it advantageous for vision-language bi-modal tasks: LGS enables
image-captioning models to generate richer captions and helps text-to-image
generation models achieve e-commerce style transfer.
Related papers
- Web-Scale Visual Entity Recognition: An LLM-Driven Data Approach [56.55633052479446]
Web-scale visual entity recognition presents significant challenges due to the lack of clean, large-scale training data.
We propose a novel methodology to curate such a dataset, leveraging a multimodal large language model (LLM) for label verification, metadata generation, and rationale explanation.
Experiments demonstrate that models trained on this automatically curated data achieve state-of-the-art performance on web-scale visual entity recognition tasks.
arXiv Detail & Related papers (2024-10-31T06:55:24Z) - AEye: A Visualization Tool for Image Datasets [18.95453617434051]
AEye is a semantically meaningful visualization tool tailored to image datasets.
AEye embeds images into semantically meaningful high-dimensional representations, facilitating data clustering and organization.
AEye facilitates semantic search functionalities for both text and image queries, enabling users to search for content.
arXiv Detail & Related papers (2024-08-07T20:19:20Z) - Enhancing Vision Models for Text-Heavy Content Understanding and Interaction [0.0]
We build a visual chat application integrating CLIP for image encoding and a model from the Massive Text Embedding Benchmark.
The aim of the project is to increase and also enhance the advance vision models' capabilities in understanding complex visual textual data interconnected data.
arXiv Detail & Related papers (2024-05-31T15:17:47Z) - xT: Nested Tokenization for Larger Context in Large Images [79.37673340393475]
xT is a framework for vision transformers which aggregates global context with local details.
We are able to increase accuracy by up to 8.6% on challenging classification tasks.
arXiv Detail & Related papers (2024-03-04T10:29:58Z) - Weakly Supervised Annotations for Multi-modal Greeting Cards Dataset [8.397847537464534]
We propose to aggregate features from pretrained images and text embeddings to learn abstract visual concepts from Greeting Cards dataset.
The proposed dataset is also useful for generating greeting card images using pre-trained text-to-image generation model.
arXiv Detail & Related papers (2022-12-01T20:07:52Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - DatasetGAN: Efficient Labeled Data Factory with Minimal Human Effort [117.41383937100751]
Current deep networks are extremely data-hungry, benefiting from training on large-scale datasets.
We show how the GAN latent code can be decoded to produce a semantic segmentation of the image.
These generated datasets can then be used for training any computer vision architecture just as real datasets are.
arXiv Detail & Related papers (2021-04-13T20:08:29Z) - Scaling Up Visual and Vision-Language Representation Learning With Noisy
Text Supervision [57.031588264841]
We leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps.
A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss.
We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme.
arXiv Detail & Related papers (2021-02-11T10:08:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.