ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
- URL: http://arxiv.org/abs/2501.11175v1
- Date: Sun, 19 Jan 2025 21:25:53 GMT
- Title: ProKeR: A Kernel Perspective on Few-Shot Adaptation of Large Vision-Language Models
- Authors: Yassir Bendou, Amine Ouasfi, Vincent Gripon, Adnane Boukhayma,
- Abstract summary: Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks.<n>In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters.<n>We propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space using CLIP as a base learner.
- Score: 8.66217922377209
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
Related papers
- Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models [85.51753014478315]
We introduce AdaptPrune, a novel plug-and-play training-free pruning method.
It builds on conventional attention-based pruning by integrating spatial distance and token similarity with an adaptive NMS approach.
Our approach ensures a comprehensive evaluation of token importance and substantially refines the pruning decisions.
arXiv Detail & Related papers (2025-03-11T03:58:17Z) - SelaVPR++: Towards Seamless Adaptation of Foundation Models for Efficient Place Recognition [69.58329995485158]
Recent studies show that the visual place recognition (VPR) method using pre-trained visual foundation models can achieve promising performance.
We propose a novel method to realize seamless adaptation of foundation models to VPR.
In pursuit of higher efficiency and better performance, we propose an extension of the SelaVPR, called SelaVPR++.
arXiv Detail & Related papers (2025-02-23T15:01:09Z) - Local Methods with Adaptivity via Scaling [38.99428012275441]
This paper aims to merge the local training technique with the adaptive approach to develop efficient distributed learning methods.
We consider the classical Local SGD method and enhance it with a scaling feature.
In addition to theoretical analysis, we validate the performance of our methods in practice by training a neural network.
arXiv Detail & Related papers (2024-06-02T19:50:05Z) - CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification [3.594351309950969]
CapS-Adapter is an innovative method that harnesses both image and caption features to exceed existing state-of-the-art techniques in training-free scenarios.
Our method achieves outstanding zero-shot classification results across 19 benchmark datasets, improving accuracy by 2.19% over the previous leading method.
arXiv Detail & Related papers (2024-05-26T14:50:40Z) - Pay Attention to Your Neighbours: Training-Free Open-Vocabulary Semantic Segmentation [19.20874993309959]
vision-language foundation models, such as CLIP, have showcased remarkable effectiveness in numerous zero-shot image-level tasks.
We propose a baseline for training-free OVSS, termed Neighbour-Aware CLIP (NACLIP)
Our method enforces localization of patches in the self-attention of CLIP's vision transformer which, despite being crucial for dense prediction tasks, has been overlooked in the OVSS literature.
arXiv Detail & Related papers (2024-04-12T01:08:04Z) - Meta-Adapter: An Online Few-shot Learner for Vision-Language Model [64.21017759533474]
Contrastive vision-language pre-training, known as CLIP, demonstrates remarkable potential in perceiving open-world visual concepts.
Few-shot learning methods based on CLIP typically require offline fine-tuning of the parameters on few-shot samples.
We propose the Meta-Adapter, a lightweight residual-style adapter, to refine the CLIP features guided by the few-shot samples in an online manner.
arXiv Detail & Related papers (2023-11-07T07:27:16Z) - Class Incremental Learning with Pre-trained Vision-Language Models [59.15538370859431]
We propose an approach to exploiting pre-trained vision-language models (e.g. CLIP) that enables further adaptation.
Experiments on several conventional benchmarks consistently show a significant margin of improvement over the current state-of-the-art.
arXiv Detail & Related papers (2023-10-31T10:45:03Z) - Tip-Adapter: Training-free Adaption of CLIP for Few-shot Classification [58.06983806317233]
Contrastive Vision-Language Pre-training, known as CLIP, has provided a new paradigm for learning visual representations using large-scale image-text pairs.
To enhance CLIP's adaption capability, existing methods proposed to fine-tune additional learnable modules.
We propose a training-free adaption method for CLIP to conduct few-shot classification, termed as Tip-Adapter.
arXiv Detail & Related papers (2022-07-19T19:12:11Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [84.88106370842883]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
CLIP-Adapter adopts an additional bottleneck layer to learn new features and performs residual-style feature blending.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z) - Content-aware Directed Propagation Network with Pixel Adaptive Kernel
Attention [20.0783340490331]
We propose a novel operation, called pixel adaptive kernel attention (PAKA)
PAKA provides directivity to the filter weights by multiplying spatially varying attention from learnable features.
Our method is trainable in an end-to-end manner and applicable to any CNN-based models.
arXiv Detail & Related papers (2021-07-28T02:59:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.