APoLLo: Unified Adapter and Prompt Learning for Vision Language Models
- URL: http://arxiv.org/abs/2312.01564v1
- Date: Mon, 4 Dec 2023 01:42:09 GMT
- Title: APoLLo: Unified Adapter and Prompt Learning for Vision Language Models
- Authors: Sanjoy Chowdhury, Sayan Nag, Dinesh Manocha
- Abstract summary: We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
- Score: 58.9772868980283
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The choice of input text prompt plays a critical role in the performance of
Vision-Language Pretrained (VLP) models such as CLIP. We present APoLLo, a
unified multi-modal approach that combines Adapter and Prompt learning for
Vision-Language models. Our method is designed to substantially improve the
generalization capabilities of VLP models when they are fine-tuned in a
few-shot setting. We introduce trainable cross-attention-based adapter layers
in conjunction with vision and language encoders to strengthen the alignment
between the two modalities. We enforce consistency between the respective
encoder branches (receiving augmented inputs) to prevent overfitting in
downstream tasks. Our method is evaluated on three representative tasks:
generalization to novel classes, cross-dataset evaluation, and unseen domain
shifts. In practice, APoLLo achieves a relative gain up to 6.03% over MaPLe
(SOTA) on novel classes for 10 diverse image recognition datasets.
Related papers
- ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning [38.26304604660713]
ADEM-VL is an efficient vision-language method that tunes models based on pretrained large language models.
Our framework surpasses existing methods by an average accuracy of 0.77% on ScienceQA dataset.
arXiv Detail & Related papers (2024-10-23T11:31:06Z) - PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter [21.45490901191175]
PaLM2-VAdapter employs a progressively aligned language model as the vision-language adapter.
Our method achieves these advancements with 3070% fewer parameters than the state-of-the-art large vision-language models.
arXiv Detail & Related papers (2024-02-16T18:54:47Z) - LaViP:Language-Grounded Visual Prompts [27.57227844809257]
We introduce a language-grounded visual prompting method to adapt the visual encoder of vision-language models for downstream tasks.
By capitalizing on language integration, we devise a parameter-efficient strategy to adjust the input of the visual encoder.
Our algorithm can operate even in black-box scenarios, showcasing adaptability in situations where access to the model's parameters is constrained.
arXiv Detail & Related papers (2023-12-18T05:50:10Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - SgVA-CLIP: Semantic-guided Visual Adapting of Vision-Language Models for
Few-shot Image Classification [84.05253637260743]
We propose a new framework, named Semantic-guided Visual Adapting (SgVA), to extend vision-language pre-trained models.
SgVA produces discriminative task-specific visual features by comprehensively using a vision-specific contrastive loss, a cross-modal contrastive loss, and an implicit knowledge distillation.
State-of-the-art results on 13 datasets demonstrate that the adapted visual features can well complement the cross-modal features to improve few-shot image classification.
arXiv Detail & Related papers (2022-11-28T14:58:15Z) - MaPLe: Multi-modal Prompt Learning [54.96069171726668]
We propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations.
Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes.
arXiv Detail & Related papers (2022-10-06T17:59:56Z) - Single-Stream Multi-Level Alignment for Vision-Language Pretraining [103.09776737512078]
We propose a single stream model that aligns the modalities at multiple levels.
We achieve this using two novel tasks: symmetric cross-modality reconstruction and a pseudo-labeled key word prediction.
We demonstrate top performance on a set of Vision-Language downstream tasks such as zero-shot/fine-tuned image/text retrieval, referring expression, and VQA.
arXiv Detail & Related papers (2022-03-27T21:16:10Z) - Probing Inter-modality: Visual Parsing with Self-Attention for
Vision-Language Pre-training [139.4566371416662]
Vision-Language Pre-training aims to learn multi-modal representations from image-text pairs.
CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies.
arXiv Detail & Related papers (2021-06-25T08:04:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.