ProtoCLIP: Prototypical Contrastive Language Image Pretraining
- URL: http://arxiv.org/abs/2206.10996v4
- Date: Tue, 21 Nov 2023 04:18:38 GMT
- Title: ProtoCLIP: Prototypical Contrastive Language Image Pretraining
- Authors: Delong Chen, Zhao Wu, Fan Liu, Zaiquan Yang, Huaxi Huang, Ying Tan,
and Erjin Zhou
- Abstract summary: Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is introduced to enhance such grouping.
ProtoCLIP sets up prototype-level discrimination between image and text spaces, which efficiently transfers higher-level structural knowledge.
ProtoCLIP is trained with an online episodic training strategy, which makes it can be scaled up to unlimited amounts of data.
- Score: 12.067061175987075
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contrastive Language Image Pretraining (CLIP) has received widespread
attention, since its learned representations can be transferred well to various
downstream tasks. During the training process of the CLIP model, the InfoNCE
objective aligns positive image-text pairs and separates negative ones. We show
an underlying representation grouping effect during this process: the InfoNCE
objective indirectly groups semantically similar representations together via
randomly emerged within-modal anchors. Based on this understanding, in this
paper, Prototypical Contrastive Language Image Pretraining (ProtoCLIP) is
introduced to enhance such grouping by boosting its efficiency and increasing
its robustness against the modality gap. Specifically, ProtoCLIP sets up
prototype-level discrimination between image and text spaces, which efficiently
transfers higher-level structural knowledge. Further, Prototypical Back
Translation (PBT) is proposed to decouple representation grouping from
representation alignment, resulting in effective learning of meaningful
representations under large modality gap. The PBT also enables us to introduce
additional external teachers with richer prior language knowledge. ProtoCLIP is
trained with an online episodic training strategy, which makes it can be scaled
up to unlimited amounts of data. We train our ProtoCLIP on Conceptual Captions
and achieved an +5.81% ImageNet linear probing improvement and an +2.01%
ImageNet zero-shot classification improvement. On the larger YFCC-15M dataset,
ProtoCLIP matches the performance of CLIP with 33% of training time. Codes are
available at https://github.com/megvii-research/protoclip.
Related papers
- TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP [84.90129481336659]
We study transferrable representation learning underlying CLIP and demonstrate how features from different modalities get aligned.
Inspired by our analysis, we propose a new CLIP-type approach, which achieves better performance than CLIP and other state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2023-10-02T06:41:30Z) - Improving CLIP Training with Language Rewrites [57.935517901210225]
We introduce Language augmented CLIP (LaCLIP) to enhance CLIP training through language rewrites.
We show that LaCLIP significantly improves the transfer performance without computation or memory overhead during training.
Specifically for ImageNet zero-shot accuracy, LaCLIP outperforms CLIP by 8.2% on CC12M and 2.4% on LAION-400M.
arXiv Detail & Related papers (2023-05-31T17:59:04Z) - Iterative Prompt Learning for Unsupervised Backlit Image Enhancement [86.90993077000789]
We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT.
We show that the open-world CLIP prior aids in distinguishing between backlit and well-lit images.
Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved.
arXiv Detail & Related papers (2023-03-30T17:37:14Z) - SegCLIP: Patch Aggregation with Learnable Centers for Open-Vocabulary
Semantic Segmentation [26.079055078561986]
We propose a CLIP-based model named SegCLIP for the topic of open-vocabulary segmentation.
The main idea is to gather patches with learnable centers to semantic regions through training on text-image pairs.
Experimental results show that our model achieves comparable or superior segmentation accuracy.
arXiv Detail & Related papers (2022-11-27T12:38:52Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.