Compress & Align: Curating Image-Text Data with Human Knowledge
- URL: http://arxiv.org/abs/2312.06726v2
- Date: Wed, 13 Dec 2023 04:31:59 GMT
- Title: Compress & Align: Curating Image-Text Data with Human Knowledge
- Authors: Lei Zhang, Fangxun Shu, Sucheng Ren, Bingchen Zhao, Hao Jiang, Cihang
Xie
- Abstract summary: This paper introduces a novel algorithm, rooted in human knowledge, to compress web-crawled image-text datasets to a compact and high-quality form.
A reward model on the annotated dataset internalizes the nuanced human understanding of image-text alignment.
Experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to 90%.
- Score: 36.34714164235438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The massive growth of image-text data through web crawling inherently
presents the challenge of variability in data quality. This paper introduces a
novel algorithm, rooted in human knowledge, to compress this vast corpus of
web-crawled image-text datasets to a compact and high-quality form. Our method
unfolds in three major steps. First, we collect an image-text dataset, wherein
each image is associated with multiple captions sourced from diverse origins.
Then, to systemically capture human preferences regarding the best caption
paired with each image, we establish a comprehensive set of both subjective and
objective criteria for critically guiding the alignment assessment from
labelers. Lastly, we train a reward model on the annotated dataset to
internalize the nuanced human understanding of image-text alignment. The
resulting reward model thus can act as a human-like referee to filter
misaligned/low-quality image-text pairs. Extensive experiments demonstrate that
we are able to secure (or even improve) model performance by compressing the
image-text datasets up to ~90%. An impressive example is that, by aggressively
reducing the total training sample from 130M to 15.5M (e.g., ~9x smaller), our
BLIP-B/16 models still consistently show superior performance compared with the
full-size-dataset counterpart on image-text retrieval (Flickr30K, COCO) by
~2.5% in Recall@1, and on image-captioning (Nocaps, COCO) by ~10.0% in CIDEr
and ~2.7% in SPICE.
Related papers
- OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text [112.60163342249682]
We introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset.
Our dataset has 15 times larger scales while maintaining good data quality.
We hope this could provide a solid data foundation for future multimodal model research.
arXiv Detail & Related papers (2024-06-12T17:01:04Z) - Text Data-Centric Image Captioning with Interactive Prompts [20.48013600818985]
Supervised image captioning approaches have made great progress, but it is challenging to collect high-quality human-annotated image-text data.
This paper proposes a new Text data-centric approach with Interactive Prompts for image Captioning, named TIPCap.
arXiv Detail & Related papers (2024-03-28T07:43:49Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Photorealistic Text-to-Image Diffusion Models with Deep Language
Understanding [53.170767750244366]
Imagen is a text-to-image diffusion model with an unprecedented degree of photorealism and a deep level of language understanding.
To assess text-to-image models in greater depth, we introduce DrawBench, a comprehensive and challenging benchmark for text-to-image models.
arXiv Detail & Related papers (2022-05-23T17:42:53Z) - PyramidCLIP: Hierarchical Feature Alignment for Vision-language Model
Pretraining [68.84339672878066]
We introduce PyramidCLIP, which constructs an input pyramid with different semantic levels, and aligns visual elements and linguistic elements in the form of hierarchy.
Experiments on three downstream tasks, including zero-shot image classification, zero-shot image-text retrieval and image object detection, verify the effectiveness of the proposed PyramidCLIP.
arXiv Detail & Related papers (2022-04-29T13:38:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.