Improving CLIP Training with Language Rewrites
- URL: http://arxiv.org/abs/2305.20088v2
- Date: Sat, 28 Oct 2023 08:46:13 GMT
- Title: Improving CLIP Training with Language Rewrites
- Authors: Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, Yonglong Tian
- Abstract summary: We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
- Score: 57.935517901210225
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) stands as one of the most
effective and scalable methods for training transferable vision models using
paired image and text data. CLIP models are trained using contrastive loss,
which typically relies on data augmentations to prevent overfitting and
shortcuts. However, in the CLIP training paradigm, data augmentations are
exclusively applied to image inputs, while language inputs remain unchanged
throughout the entire training process, limiting the exposure of diverse texts
to the same image. In this paper, we introduce Language augmented CLIP
(LaCLIP), a simple yet highly effective approach to enhance CLIP training
through language rewrites. Leveraging the in-context learning capability of
large language models, we rewrite the text descriptions associated with each
image. These rewritten texts exhibit diversity in sentence structure and
vocabulary while preserving the original key concepts and meanings. During
training, LaCLIP randomly selects either the original texts or the rewritten
versions as text augmentations for each image. Extensive experiments on CC3M,
CC12M, RedCaps and LAION-400M datasets show that CLIP pre-training with
language rewrites significantly improves the transfer performance without
computation or memory overhead during training. Specifically for ImageNet
zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on
LAION-400M. Code is available at https://github.com/LijieFan/LaCLIP.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - VeCLIP: Improving CLIP Training via Visual-enriched Captions [63.547204530720705]
This study introduces a scalable pipeline for noisy caption rewriting.
We emphasize the incorporation of visual concepts into captions, termed as Visual-enriched Captions (VeCap)
We showcase the adaptation of this method for training CLIP on large-scale web-crawled datasets, termed VeCLIP.
arXiv Detail & Related papers (2023-10-11T17:49:13Z) - CgT-GAN: CLIP-guided Text GAN for Image Captioning [48.276753091051035]
We propose CLIP-guided text GAN (CgT-GAN) to enable the model to "see" real visual modality.
We use adversarial training to teach CgT-GAN to mimic the phrases of an external text corpus.
CgT-GAN outperforms state-of-the-art methods significantly across all metrics.
arXiv Detail & Related papers (2023-08-23T10:25:37Z) - S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist
Captions [69.01985134519244]
Vision-language models, such as contrastive language-image pre-training (CLIP), have demonstrated impressive results in natural image domains.
We propose S-CLIP, a semi-supervised learning method for training CLIP that utilizes additional unpaired images.
S-CLIP improves CLIP by 10% for zero-shot classification and 4% for image-text retrieval on the remote sensing benchmark.
arXiv Detail & Related papers (2023-05-23T14:18:11Z) - Less is More: Removing Text-regions Improves CLIP Training Efficiency
and Robustness [19.77762574325687]
The CLIP (Contrastive Language-Image Pre-training) model and its variants are becoming the de facto backbone in many applications.
We discuss two effective approaches to improve the efficiency and robustness of CLIP training.
Our filter-based CLIP model demonstrates a top-1 accuracy of 68.78%, outperforming previous models whose accuracy was all below 50%.
arXiv Detail & Related papers (2023-05-08T23:47:07Z) - CLIPPO: Image-and-Language Understanding from Pixels Only [36.433133689137875]
We propose a pure pixel-based model to perform image, text, and multimodal tasks.
Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO)
When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks.
arXiv Detail & Related papers (2022-12-15T18:52:08Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - ProtoCLIP: Prototypical Contrastive Language Image Pretraining [12.067061175987075]
Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
arXiv Detail & Related papers (2022-06-22T11:55:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.