CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention
- URL: http://arxiv.org/abs/2209.14169v1
- Date: Wed, 28 Sep 2022 15:22:11 GMT
- Title: CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention
- Authors: Ziyu Guo, Renrui Zhang, Longtian Qiu, Xianzheng Ma, Xupeng Miao,
Xuming He, Bin Cui
- Abstract summary: Contrastive Language-Image Pre-training has been shown to learn visual representations with great transferability.
Existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets.
We introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module.
- Score: 31.84299688413136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual
representations with great transferability, which achieves promising accuracy
for zero-shot classification. To further improve its downstream performance,
existing works propose additional learnable modules upon CLIP and fine-tune
them by few-shot training sets. However, the resulting extra training cost and
data requirement severely hinder the efficiency for model deployment and
knowledge transfer. In this paper, we introduce a free-lunch enhancement
method, CALIP, to boost CLIP's zero-shot performance via a parameter-free
Attention module. Specifically, we guide visual and textual representations to
interact with each other and explore cross-modal informative features via
attention. As the pre-training has largely reduced the embedding distances
between two modalities, we discard all learnable parameters in the attention
and bidirectionally update the multi-modal features, enabling the whole process
to be parameter-free and training-free. In this way, the images are blended
with textual-aware signals and the text representations become visual-guided
for better adaptive zero-shot alignment. We evaluate CALIP on various
benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot
classification, showing consistent zero-shot performance improvement over CLIP.
Based on that, we further insert a small number of linear layers in CALIP's
attention module and verify our robustness under the few-shot settings, which
also achieves leading performance compared to existing methods. Those extensive
experiments demonstrate the superiority of our approach for efficient
enhancement of CLIP.
Related papers
- FALIP: Visual Prompt as Foveal Attention Boosts CLIP Zero-Shot Performance [7.041364616661048]
Foveal-Attention CLIP (FALIP) adjusts the CLIP's attention by inserting foveal attention masks into the multi-head self-attention module.
FALIP effectively boosts CLIP zero-shot performance in tasks such as referring expressions comprehension, image classification, and 3D point cloud recognition.
arXiv Detail & Related papers (2024-07-08T03:23:13Z) - Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification [54.96876797812238]
We present a novel CrOss-moDal nEighbor Representation(CODER) based on the distance structure between images and their neighbor texts.
The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images.
Experiment results across various datasets and models confirm CODER's effectiveness.
arXiv Detail & Related papers (2024-04-27T02:04:36Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Boosting Visual-Language Models by Exploiting Hard Samples [126.35125029639168]
HELIP is a cost-effective strategy tailored to enhance the performance of existing CLIP models.
Our method allows for effortless integration with existing models' training pipelines.
On comprehensive benchmarks, HELIP consistently boosts existing models to achieve leading performance.
arXiv Detail & Related papers (2023-05-09T07:00:17Z) - Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting [111.49781716597984]
We propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training.
We can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting.
arXiv Detail & Related papers (2023-04-06T18:00:04Z) - Non-Contrastive Learning Meets Language-Image Pre-Training [145.6671909437841]
We study the validity of non-contrastive language-image pre-training (nCLIP)
We introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics.
arXiv Detail & Related papers (2022-10-17T17:57:46Z) - Masked Unsupervised Self-training for Zero-shot Image Classification [98.23094305347709]
Masked Unsupervised Self-Training (MUST) is a new approach which leverages two different and complimentary sources of supervision: pseudo-labels and raw images.
MUST improves upon CLIP by a large margin and narrows the performance gap between unsupervised and supervised classification.
arXiv Detail & Related papers (2022-06-07T02:03:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.