Related papers: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP

URL: http://arxiv.org/abs/2310.00927v2
Date: Thu, 11 Jul 2024 00:38:08 GMT
Title: Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP
Authors: Zixiang Chen, Yihe Deng, Yuanzhi Li, Quanquan Gu,
Abstract summary: We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
Score: 84.90129481336659
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-modal learning has become increasingly popular due to its ability to leverage information from different data sources (e.g., text and images) to improve the model performance. Recently, CLIP has emerged as an effective approach that employs vision-language contrastive pretraining to learn joint image and text representations and exhibits remarkable performance in zero-shot learning and text-guided natural image generation. Despite the huge practical success of CLIP, its theoretical understanding remains elusive. In this paper, we formally study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned. We also analyze its zero-shot transfer performance on the downstream tasks. Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.

Related papers

Learning Visual Composition through Improved Semantic Guidance [19.24813992815684]
We show that by substantially improving weakly labeled data, we can vastly improve the performance of standard contrastive learning approaches. We showcase our results on a relatively new captioning benchmark derived from DOCCI. We demonstrate through a series of ablations that a standard CLIP model trained with enhanced data may demonstrate impressive performance on image retrieval tasks.
arXiv Detail & Related papers (2024-12-19T20:58:26Z)
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations. We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution. We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z)
CLIP-CID: Efficient CLIP Distillation via Cluster-Instance Discrimination [28.061239778773423]
Contrastive Language-Image Pre-training (CLIP) has achieved excellent performance over a wide range of tasks. CLIP heavily relies on a substantial corpus of pre-training data, resulting in notable consumption of computational resources. We introduce CLIP-CID, a novel distillation mechanism that effectively transfers knowledge from a large vision-language foundation model to a smaller model.
arXiv Detail & Related papers (2024-08-18T11:23:21Z)
CLAP: Isolating Content from Style through Contrastive Learning with Augmented Prompts [11.752632557524969]
We propose contrastive learning with data augmentation to disentangle content features from the original representations. Our experiments across diverse datasets demonstrate significant improvements in zero-shot and few-shot classification tasks.
arXiv Detail & Related papers (2023-11-28T03:00:59Z)
Distilling Knowledge from Text-to-Image Generative Models Improves Visio-Linguistic Reasoning in CLIP [57.53087077735303]
We introduce SDS-CLIP, a lightweight and sample-efficient distillation method to enhance CLIP's compositional visio-linguistic reasoning. Our approach fine-tunes CLIP using a distillation objective borrowed from large text-to-image generative models like Stable-Diffusion. On the challenging Winoground benchmark, SDS-CLIP improves the visio-linguistic performance of various CLIP models by up to 7%, while on the ARO dataset, it boosts performance by up to 3%.
arXiv Detail & Related papers (2023-07-18T13:10:11Z)
Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP) We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z)
Contrastive Language-Image Pre-Training with Knowledge Graphs [33.211811772961234]
We propose a knowledge-based pre-training framework, dubbed Knowledge-CLIP, which injects semantic information into the widely used CLIP model. Our model can semantically align the representations in vision and language with higher quality, and enhance the reasoning ability across scenarios and modalities.
arXiv Detail & Related papers (2022-10-17T09:49:22Z)
Robust Cross-Modal Representation Learning with Progressive Self-Distillation [7.676408770854477]
The learning objective of vision-language approach of CLIP does not effectively account for the noisy many-to-many correspondences found in web-harvested image captioning datasets. We introduce a novel training framework based on cross-modal contrastive learning that uses progressive self-distillation and soft image-text alignments to more efficiently learn robust representations from noisy data.
arXiv Detail & Related papers (2022-04-10T03:28:18Z)
CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS) CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment. Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
VL-LTR: Learning Class-wise Visual-Linguistic Representation for Long-Tailed Visual Recognition [61.75391989107558]
We present a visual-linguistic long-tailed recognition framework, termed VL-LTR. Our method can learn visual representation from images and corresponding linguistic representation from noisy class-level text descriptions. Notably, our method achieves 77.2% overall accuracy on ImageNet-LT, which significantly outperforms the previous best method by over 17 points.
arXiv Detail & Related papers (2021-11-26T16:24:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.