IDEA: Image Description Enhanced CLIP-Adapter
- URL: http://arxiv.org/abs/2501.08816v2
- Date: Sun, 19 Jan 2025 02:34:44 GMT
- Title: IDEA: Image Description Enhanced CLIP-Adapter
- Authors: Zhipeng Ye, Feng Jiang, Qiufeng Wang, Kaizhu Huang, Jiaqi Huang,
- Abstract summary: We propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks.<n>IDEA captures fine-grained features by leveraging both visual features and textual descriptions of images.<n>As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets.
- Score: 23.446016867479138
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: CLIP (Contrastive Language-Image Pre-training) has attained great success in pattern recognition and computer vision. Transferring CLIP to downstream tasks (e.g. zero- or few-shot classification) is a hot topic in multimodal learning. However, current studies primarily focus on either prompt learning for text or adapter tuning for vision, without fully exploiting the complementary information and correlations among image-text pairs. In this paper, we propose an Image Description Enhanced CLIP-Adapter (IDEA) method to adapt CLIP to few-shot image classification tasks. This method captures fine-grained features by leveraging both visual features and textual descriptions of images. IDEA is a training-free method for CLIP, and it can be comparable to or even exceeds state-of-the-art models on multiple tasks. Furthermore, we introduce Trainable-IDEA (T-IDEA), which extends IDEA by adding two lightweight learnable components (i.e., a projector and a learnable latent space), further enhancing the model's performance and achieving SOTA results on 11 datasets. As one important contribution, we employ the Llama model and design a comprehensive pipeline to generate textual descriptions for images of 11 datasets, resulting in a total of 1,637,795 image-text pairs, named "IMD-11". Our code and data are released at https://github.com/FourierAI/IDEA.
Related papers
- TULIP: Towards Unified Language-Image Pretraining [60.99500935831526]
We introduce T, an open-source, drop-in replacement for existing CLIP-like models.
Our method leverages generative data augmentation, enhanced image-image and text-text contrastive learning, and image/text reconstruction regularization to learn fine-grained visual features.
Our approach, scaling to over 1B parameters, outperforms existing state-of-the-art (SOTA) models across benchmarks.
arXiv Detail & Related papers (2025-03-19T17:58:57Z) - DiffCLIP: Few-shot Language-driven Multimodal Classifier [19.145645804307566]
DiffCLIP is a novel framework that extends Contrastive Language-Image Pretraining.
It conveys comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images.
DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP.
arXiv Detail & Related papers (2024-12-10T02:21:39Z) - Ranking-aware adapter for text-driven image ordering with CLIP [76.80965830448781]
We propose an effective yet efficient approach that reframes the CLIP model into a learning-to-rank task.<n>We introduce a lightweight adapter to augment CLIP for text-guided image ranking.<n>Our approach provides a natural and generalized way of learning from visual differences across images.
arXiv Detail & Related papers (2024-12-09T18:51:05Z) - TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives [65.82577305915643]
Contrastive Language-Image Pretraining (CLIP) models maximize the mutual information between text and visual modalities to learn representations.
We show that generating hard'' negative captions via in-context learning and corresponding negative images with text-to-image generators offers a solution.
We demonstrate that our method, named TripletCLIP, enhances the compositional capabilities of CLIP, resulting in an absolute improvement of over 9% on the SugarCrepe benchmark.
arXiv Detail & Related papers (2024-11-04T19:24:59Z) - Enhancing Image Retrieval : A Comprehensive Study on Photo Search using
the CLIP Mode [0.27195102129095]
Photo search has witnessed significant advancements with the introduction of CLIP (Contrastive Language-Image Pretraining) model.
This abstract summarizes the foundational principles of CLIP and highlights its potential impact on advancing the field of photo search.
arXiv Detail & Related papers (2024-01-24T17:35:38Z) - CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement [65.47237619200442]
Contrastive language image pretraining (CLIP) is a standard method for training vision-language models.
We augment CLIP training with task-specific vision models from model zoos to improve its visual representations.
This simple setup shows substantial improvements of up to 16.3% across different vision tasks.
arXiv Detail & Related papers (2023-10-21T20:20:13Z) - GraphAdapter: Tuning Vision-Language Models With Dual Knowledge Graph [63.81641578763094]
adapter-style efficient transfer learning (ETL) has shown excellent performance in the tuning of vision-language models (VLMs)
We propose an effective adapter-style tuning strategy, dubbed GraphAdapter, which performs the textual adapter by explicitly modeling the dual-modality structure knowledge.
In particular, the dual knowledge graph is established with two sub-graphs, i.e., a textual knowledge sub-graph, and a visual knowledge sub-graph, where the nodes and edges represent the semantics/classes and their correlations in two modalities, respectively.
arXiv Detail & Related papers (2023-09-24T12:56:40Z) - Composed Image Retrieval using Contrastive Learning and Task-oriented
CLIP-based Features [32.138956674478116]
Given a query composed of a reference image and a relative caption, the Composed Image Retrieval goal is to retrieve images visually similar to the reference one.
We use features from the OpenAI CLIP model to tackle the considered task.
We train a Combiner network that learns to combine the image-text features integrating the bimodal information.
arXiv Detail & Related papers (2023-08-22T15:03:16Z) - Self-Supervised Image Captioning with CLIP [0.0]
We introduce a self-supervised image captioning method.
After learning an initial signal from a small labeled dataset, our method transitions to self-supervised learning on unlabeled data.
Despite utilizing less than 2% of the labeled COCO dataset, our method delivers a performance comparable to state-of-the-art models trained on the complete dataset.
arXiv Detail & Related papers (2023-06-26T23:29:16Z) - CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language
Representation Alignment [146.3128011522151]
We propose a Omni Crossmodal Learning method equipped with a Video Proxy mechanism on the basis of CLIP, namely CLIP-ViP.
Our approach improves the performance of CLIP on video-text retrieval by a large margin.
Our model also achieves SOTA results on a variety of datasets, including MSR-VTT, DiDeMo, LSMDC, and ActivityNet.
arXiv Detail & Related papers (2022-09-14T05:47:02Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z) - Unsupervised Prompt Learning for Vision-Language Models [12.259694415428026]
We propose an unsupervised prompt learning (UPL) framework to improve the zero-shot transfer of CLIP-like vision-language models.
An enhanced version of UPL is even on par with the 8-shot CoOp and the 8-shot TIP-Adapter on most datasets.
arXiv Detail & Related papers (2022-04-07T17:59:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.