LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models
- URL: http://arxiv.org/abs/2312.00674v1
- Date: Fri, 1 Dec 2023 15:54:55 GMT
- Title: LightCLIP: Learning Multi-Level Interaction for Lightweight
Vision-Language Models
- Authors: Ying Nie, Wei He, Kai Han, Yehui Tang, Tianyu Guo, Fanyi Du, Yunhe
Wang
- Abstract summary: We propose a multi-level interaction paradigm for training lightweight CLIP models.
An auxiliary fusion module injecting unmasked image embedding into masked text embedding is proposed.
- Score: 45.672539931681065
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision-language pre-training like CLIP has shown promising performance on
various downstream tasks such as zero-shot image classification and image-text
retrieval. Most of the existing CLIP-alike works usually adopt relatively large
image encoders like ResNet50 and ViT, while the lightweight counterparts are
rarely discussed. In this paper, we propose a multi-level interaction paradigm
for training lightweight CLIP models. Firstly, to mitigate the problem that
some image-text pairs are not strictly one-to-one correspondence, we improve
the conventional global instance-level alignment objective by softening the
label of negative samples progressively. Secondly, a relaxed bipartite matching
based token-level alignment objective is introduced for finer-grained alignment
between image patches and textual words. Moreover, based on the observation
that the accuracy of CLIP model does not increase correspondingly as the
parameters of text encoder increase, an extra objective of masked language
modeling (MLM) is leveraged for maximizing the potential of the shortened text
encoder. In practice, an auxiliary fusion module injecting unmasked image
embedding into masked text embedding at different network stages is proposed
for enhancing the MLM. Extensive experiments show that without introducing
additional computational cost during inference, the proposed method achieves a
higher performance on multiple downstream tasks.
Related papers
- Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment [0.7499722271664144]
Contrastive Language and Image Pairing (CLIP) is a transformative method in multimedia retrieval.
CLIP typically trains two neural networks concurrently to generate joint embeddings for text and image pairs.
This paper addresses the challenge of optimizing CLIP models for various image-based similarity search scenarios.
arXiv Detail & Related papers (2024-09-03T14:33:01Z) - Diffusion Feedback Helps CLIP See Better [40.125318318373715]
Contrastive Language-Image Pre-training (CLIP) excels at abstracting open-world representations across domains and modalities.
CLIP has severe visual shortcomings, such as which can hardly distinguish orientation, quantity, color, structure.
We present a post-training approach for CLIP models, which largely overcomes its visual shortcomings via a self-supervised diffusion process.
arXiv Detail & Related papers (2024-07-29T17:00:09Z) - Enhancing Vision-Language Model with Unmasked Token Alignment [37.12838142681491]
This paper introduces Unmasked Token Alignment (UTA), a method that leverages existing CLIP models to further enhance its vision-language representations.
UTA trains a Vision Transformer (ViT) by aligning unmasked visual tokens to the corresponding image tokens from a frozen CLIP vision encoder, which automatically aligns the ViT model with the CLIP text encoder.
arXiv Detail & Related papers (2024-05-29T11:48:17Z) - Fine-tuning CLIP Text Encoders with Two-step Paraphrasing [83.3736789315201]
We introduce a straightforward fine-tuning approach to enhance the representations of CLIP models for paraphrases.
Our model, which we call ParaCLIP, exhibits significant improvements over baseline CLIP models across various tasks.
arXiv Detail & Related papers (2024-02-23T06:11:50Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for
Cross-Modal Retrieval [59.15034487974549]
We propose a novel COllaborative Two-Stream vision-language pretraining model termed COTS for image-text retrieval.
Our COTS achieves the highest performance among all two-stream methods and comparable performance with 10,800X faster in inference.
Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-ofthe-art on the widely-used MSR-VTT dataset.
arXiv Detail & Related papers (2022-04-15T12:34:47Z) - CRIS: CLIP-Driven Referring Image Segmentation [71.56466057776086]
We propose an end-to-end CLIP-Driven Referring Image framework (CRIS)
CRIS resorts to vision-language decoding and contrastive learning for achieving the text-to-pixel alignment.
Our proposed framework significantly outperforms the state-of-the-art performance without any post-processing.
arXiv Detail & Related papers (2021-11-30T07:29:08Z) - Learning Contrastive Representation for Semantic Correspondence [150.29135856909477]
We propose a multi-level contrastive learning approach for semantic matching.
We show that image-level contrastive learning is a key component to encourage the convolutional features to find correspondence between similar objects.
arXiv Detail & Related papers (2021-09-22T18:34:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.