Compress & Align: Curating Image-Text Data with Human Knowledge
- URL: http://arxiv.org/abs/2312.06726v2
- Date: Wed, 13 Dec 2023 04:31:59 GMT
- Title: Compress & Align: Curating Image-Text Data with Human Knowledge
- Authors: Lei Zhang, Fangxun Shu, Sucheng Ren, Bingchen Zhao, Hao Jiang, Cihang
Xie
- Abstract summary: This paper introduces a novel algorithm, rooted in human knowledge, to compress web-crawled image-text datasets to a compact and high-quality form.
A reward model on the annotated dataset internalizes the nuanced human understanding of image-text alignment.
Experiments demonstrate that we are able to secure (or even improve) model performance by compressing the image-text datasets up to 90%.
- Score: 36.34714164235438
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The massive growth of image-text data through web crawling inherently
presents the challenge of variability in data quality. This paper introduces a
novel algorithm, rooted in human knowledge, to compress this vast corpus of
web-crawled image-text datasets to a compact and high-quality form. Our method
unfolds in three major steps. First, we collect an image-text dataset, wherein
each image is associated with multiple captions sourced from diverse origins.
Then, to systemically capture human preferences regarding the best caption
paired with each image, we establish a comprehensive set of both subjective and
objective criteria for critically guiding the alignment assessment from
labelers. Lastly, we train a reward model on the annotated dataset to
internalize the nuanced human understanding of image-text alignment. The
resulting reward model thus can act as a human-like referee to filter
misaligned/low-quality image-text pairs. Extensive experiments demonstrate that
we are able to secure (or even improve) model performance by compressing the
image-text datasets up to ~90%. An impressive example is that, by aggressively
reducing the total training sample from 130M to 15.5M (e.g., ~9x smaller), our
BLIP-B/16 models still consistently show superior performance compared with the
full-size-dataset counterpart on image-text retrieval (Flickr30K, COCO) by
~2.5% in Recall@1, and on image-captioning (Nocaps, COCO) by ~10.0% in CIDEr
and ~2.7% in SPICE.
Related papers
- Beyond Filtering: Adaptive Image-Text Quality Enhancement for MLLM Pretraining [31.176432567292093]
We propose the Adaptive Image-Text Quality Enhancer (AITQE), a model that dynamically assesses and enhances the quality of image-text pairs.
AITQE employs a text rewriting mechanism for low-quality pairs and incorporates a negative sample learning strategy to improve evaluative capabilities.
arXiv Detail & Related papers (2024-10-21T16:32:41Z) - Debiasing Vison-Language Models with Text-Only Training [15.069736314663352]
We propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
To address the limitations, we propose a Text-Only Debiasing framework called TOD, leveraging a text-as-image training paradigm to mitigate visual biases.
arXiv Detail & Related papers (2024-10-12T04:34:46Z) - HYPE: Hyperbolic Entailment Filtering for Underspecified Images and Texts [49.21764163995419]
We introduce HYPerbolic Entailment filtering (HYPE) to extract meaningful and well-aligned data from noisy image-text pair datasets.
HYPE not only demonstrates a significant improvement in filtering efficiency but also sets a new state-of-the-art in the DataComp benchmark.
This breakthrough showcases the potential of HYPE to refine the data selection process, thereby contributing to the development of more accurate and efficient self-supervised learning models.
arXiv Detail & Related papers (2024-04-26T16:19:55Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - ASPIRE: Language-Guided Data Augmentation for Improving Robustness Against Spurious Correlations [43.323791505213634]
ASPIRE (Language-guided Data Augmentation for SPurIous correlation REmoval) is a solution for supplementing the training dataset with images without spurious features.
It can generate non-spurious images without requiring any group labeling or existing non-spurious images in the training set.
It improves the worst-group classification accuracy of prior methods by 1% - 38%.
arXiv Detail & Related papers (2023-08-19T20:18:15Z) - Improving Multimodal Datasets with Image Captioning [65.74736570293622]
We study how generated captions can increase the utility of web-scraped datapoints with nondescript text.
Our experiments with using generated captions at DataComp's large scale (1.28B image-text pairs) offer insights into the limitations of synthetic text.
arXiv Detail & Related papers (2023-07-19T17:47:12Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - Generative Negative Text Replay for Continual Vision-Language
Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently.
Massive data are usually collected in a streaming fashion.
We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z) - Generating More Pertinent Captions by Leveraging Semantics and Style on
Multi-Source Datasets [56.018551958004814]
This paper addresses the task of generating fluent descriptions by training on a non-uniform combination of data sources.
Large-scale datasets with noisy image-text pairs provide a sub-optimal source of supervision.
We propose to leverage and separate semantics and descriptive style through the incorporation of a style token and keywords extracted through a retrieval component.
arXiv Detail & Related papers (2021-11-24T19:00:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.