Related papers: Too Large; Data Reduction for Vision-Language Pre-Training

Too Large; Data Reduction for Vision-Language Pre-Training

URL: http://arxiv.org/abs/2305.20087v3
Date: Fri, 18 Aug 2023 08:20:06 GMT
Title: Too Large; Data Reduction for Vision-Language Pre-Training
Authors: Alex Jinpeng Wang, Kevin Qinghong Lin, David Junhao Zhang, Stan Weixian Lei and Mike Zheng Shou
Abstract summary: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used Vision-Language Pre-Training datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem.
Score: 20.523430997393888
License: http://creativecommons.org/licenses/by/4.0/
Abstract: This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR, which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.82M to 0.67M ($\sim$24\%) and noisy YFCC15M from 15M to 2.5M ($\sim$16.7\%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. The code will be made available at \url{https://github.com/showlab/datacentric.vlp}.

Related papers

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models [15.877790469608662]
We introduce an LVLM-driven data refinement pipeline to enhance the quality of image-text pair data.<n>We propose a training paradigm that extends conventional contrastive learning by incorporating negative descriptions and short tags.<n>Our approach achieves state-of-the-art performance in zero-shot classification, cross-modal retrieval, and fine-grained visual understanding tasks.
arXiv Detail & Related papers (2025-07-30T07:21:36Z)
Trust the Model: Compact VLMs as In-Context Judges for Image-Text Data Quality [5.750869893508341]
Vision-language models (VLMs) extend the conventional large language models by integrating visual data, enabling richer multimodal reasoning.<n>We introduce a streamlined data filtration framework that employs a compact VLM, fine-tuned on a high-quality image-caption annotated dataset.<n>This model effectively evaluates and filters potential training samples based on caption and image quality and alignment.
arXiv Detail & Related papers (2025-07-27T07:20:25Z)
Multi-Sentence Grounding for Long-term Instructional Video [63.27905419718045]
We aim to establish an automatic, scalable pipeline for denoising a large-scale instructional dataset. We construct a high-quality video-text dataset with multiple descriptive steps supervision, named HowToStep.
arXiv Detail & Related papers (2023-12-21T17:28:09Z)
VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting. We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap) We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z)
Sieve: Multimodal Dataset Pruning Using Image Captioning Models [11.362835828985494]
Vision-Language Models (VLMs) are pretrained on large, diverse, and noisy web-crawled datasets. We argue that this approach suffers from multiple limitations including false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal, Sieve, that employs synthetic captions generated by image-captioning models pretrained on small, diverse, and well-aligned image-text pairs.
arXiv Detail & Related papers (2023-10-03T14:53:53Z)
Weakly Supervised Vision-and-Language Pre-training with Relative Representations [76.63610760577214]
Weakly supervised vision-and-language pre-training has been shown to effectively reduce the data cost of pre-training. Current methods use only local descriptions of images, i.e., object tags, as cross-modal anchors to construct weakly-aligned image-text pairs for pre-training.
arXiv Detail & Related papers (2023-05-24T18:10:24Z)
CiT: Curation in Training for Effective Vision-Language Data [84.77867625605053]
This paper presents Curation in Training (CiT), a vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training. We observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
arXiv Detail & Related papers (2023-01-05T18:59:57Z)
Generative Negative Text Replay for Continual Vision-Language Pretraining [95.2784858069843]
Vision-language pre-training has attracted increasing attention recently. Massive data are usually collected in a streaming fashion. We propose a multi-modal knowledge distillation between images and texts to align the instance-wise prediction between old and new models.
arXiv Detail & Related papers (2022-10-31T13:42:21Z)
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision [6.8582563015193]
Weakly-supervised vision-language pre-training aims at learning cross-modal alignment with little or no paired data. Recent methods, which pair visual features with object tags, help achieve performances comparable with some models trained with aligned pairs in various V-L downstream tasks. We address the lack of paired V-L data for model supervision with a novel Visual Vocabulary based Feature Hallucinator (WFH) WFH generates visual hallucinations from texts, which are then paired with the originally unpaired texts, allowing more diverse interactions across modalities.
arXiv Detail & Related papers (2022-10-24T20:30:55Z)
Unsupervised Vision-and-Language Pre-training via Retrieval-based Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images. We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks. A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.