CiT: Curation in Training for Effective Vision-Language Data
- URL: http://arxiv.org/abs/2301.02241v1
- Date: Thu, 5 Jan 2023 18:59:57 GMT
- Title: CiT: Curation in Training for Effective Vision-Language Data
- Authors: Hu Xu, Saining Xie, Po-Yao Huang, Licheng Yu, Russell Howes, Gargi
Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
- Abstract summary: This paper presents Curation in Training (CiT), a vision-text learning algorithm that couples a data objective into training.
CiT automatically yields quality data to speed-up contrastive image-text training.
We observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large.
- Score: 84.77867625605053
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large vision-language models are generally applicable to many downstream
tasks, but come at an exorbitant training cost that only large institutions can
afford. This paper trades generality for efficiency and presents Curation in
Training (CiT), a simple and efficient vision-text learning algorithm that
couples a data objective into training. CiT automatically yields quality data
to speed-up contrastive image-text training and alleviates the need for an
offline data filtering pipeline, allowing broad data sources (including raw
image-text pairs from the web). CiT contains two loops: an outer loop curating
the training data and an inner loop consuming the curated training data. The
text encoder connects the two loops. Given metadata for tasks of interest,
e.g., class names, and a large pool of image-text pairs, CiT alternatively
selects relevant training data from the pool by measuring the similarity of
their text embeddings and embeddings of the metadata. In our experiments, we
observe that CiT can speed up training by over an order of magnitude,
especially if the raw data size is large.
Related papers
- Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning [78.19528555505961]
We propose a novel vision model pre-training method called Latent Compression Learning (LCL) for interleaved image-text data.
The training objective can be decomposed into two basic tasks: 1) contrastive learning between visual representation and preceding context, and 2) generating subsequent text based on visual representation.
Our experiments demonstrate that our method not only matches the performance of CLIP on paired pre-training datasets, but can also leverage interleaved pre-training data.
arXiv Detail & Related papers (2024-06-11T17:59:35Z) - CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data [40.88256210436378]
We present a novel weakly supervised pre-training of vision models on web-scale image-text data.
The proposed method reframes pre-training on image-text data as a classification task.
It achieves a remarkable $2.7times$ acceleration in training speed compared to contrastive learning on web-scale data.
arXiv Detail & Related papers (2024-04-24T05:13:28Z) - Leveraging Unpaired Data for Vision-Language Generative Models via Cycle
Consistency [47.3163261953469]
Current vision-language generative models rely on expansive corpora of paired image-text data to attain optimal performance and generalization capabilities.
We introduce ITIT: an innovative training paradigm grounded in the concept of cycle consistency which allows vision-language training on unpaired image and text data.
ITIT is comprised of a joint image-text encoder with disjoint image and text decoders that enable bidirectional image-to-text and text-to-image generation in a single framework.
arXiv Detail & Related papers (2023-10-05T17:55:19Z) - Data Filtering Networks [67.827994353269]
We study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset.
Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks.
Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets.
arXiv Detail & Related papers (2023-09-29T17:37:29Z) - Semi-Supervised Image Captioning by Adversarially Propagating Labeled
Data [95.0476489266988]
We present a novel data-efficient semi-supervised framework to improve the generalization of image captioning models.
Our proposed method trains a captioner to learn from a paired data and to progressively associate unpaired data.
Our extensive and comprehensive empirical results both on (1) image-based and (2) dense region-based captioning datasets followed by comprehensive analysis on the scarcely-paired dataset.
arXiv Detail & Related papers (2023-01-26T15:25:43Z) - ASIF: Coupled Data Turns Unimodal Models to Multimodal Without Training [29.240131406803794]
We show that a common space can be created without any training at all, using single-domain encoders and a much smaller amount of image-text pairs.
Our model has unique properties, most notably, deploying a new version with updated training samples can be done in a matter of seconds.
arXiv Detail & Related papers (2022-10-04T16:56:22Z) - Curriculum Learning for Data-Efficient Vision-Language Alignment [29.95935291982015]
Aligning image and text encoders from scratch using contrastive learning requires large amounts of paired image-text data.
We alleviate this need by aligning individually pre-trained language and vision representation models using a much smaller amount of paired data.
TOnICS outperforms CLIP on downstream zero-shot image retrieval while using less than 1% as much training data.
arXiv Detail & Related papers (2022-07-29T07:45:56Z) - Unsupervised Vision-and-Language Pre-training via Retrieval-based
Multi-Granular Alignment [66.77841319057299]
We propose a novel unsupervised Vision-and-Language pre-training curriculum for non-parallel texts and images.
We first construct a weakly aligned image-text corpus via a retrieval-based approach, then apply a set of multi-granular alignment pre-training tasks.
A comprehensive ablation study shows each granularity is helpful to learn a stronger pre-trained model.
arXiv Detail & Related papers (2022-03-01T05:34:01Z) - Frozen in Time: A Joint Video and Image Encoder for End-to-End Retrieval [80.7397409377659]
We propose an end-to-end trainable model that is designed to take advantage of both large-scale image and video captioning datasets.
Our model is flexible and can be trained on both image and video text datasets, either independently or in conjunction.
We show that this approach yields state-of-the-art results on standard downstream video-retrieval benchmarks.
arXiv Detail & Related papers (2021-04-01T17:48:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.