Related papers: One Last Attention for Your Vision-Language Model

One Last Attention for Your Vision-Language Model

URL: http://arxiv.org/abs/2507.15480v2
Date: Mon, 28 Jul 2025 04:47:15 GMT
Title: One Last Attention for Your Vision-Language Model
Authors: Liang Chen, Ghazi Shazan Ahmad, Tianjun Yao, Lingqiao Liu, Zhiqiang Shen,
Abstract summary: We propose textbfRational textbfAdaptaion (RAda) to explicitly exploit the final fused representation during fine-tuning.<n> RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix.<n>Experiments show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings.
Score: 42.872184600248914
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Pretrained vision-language models (VLMs), such as CLIP, achieve remarkable zero-shot performance, yet their downstream potential hinges on effective fine-tuning. Most adaptation methods typically focus on refining representation from separate modalities (text or vision) but neglect the critical role of their fused representations in the decision-making process, \emph{\ie} rational matrix that drives the final prediction. To bridge the gap, we propose a simple yet effective \textbf{R}ational \textbf{Ada}ptaion ({RAda}) to explicitly exploit the final fused representation during fine-tuning. RAda employs a learned mask, obtained from a lightweight attention layer attached at the end of a VLM, to dynamically calibrate the contribution of each element in the rational matrix, enabling targeted adjustments to the final cross-modal interactions without incurring costly modifications to intermediate features. Experiments in different settings (i.e., updating, or freezing pretrained encoders in adaptation, and test-time training that can only access the unlabeled test data) show that RAda serves as a versatile fine-tuning technique, improving the baseline with minimal code and performing comparably against current arts in most settings. Code is available at \href{https://github.com/khufia/RAda/tree/main}{github.com/khufia/RAda}.

Related papers

Multimodal LLM-Guided Semantic Correction in Text-to-Image Diffusion [52.315729095824906]
MLLM Semantic-Corrected Ping-Pong-Ahead Diffusion (PPAD) is a novel framework that introduces a Multimodal Large Language Model (MLLM) as a semantic observer during inference.<n>It performs real-time analysis on intermediate generations, identifies latent semantic inconsistencies, and translates feedback into controllable signals that actively guide the remaining denoising steps.<n>Extensive experiments demonstrate PPAD's significant improvements.
arXiv Detail & Related papers (2025-05-26T14:42:35Z)
End-to-End Vision Tokenizer Tuning [73.3065542220568]
The vision tokenizer optimized for low-level reconstruction is to downstream tasks requiring varied representations and semantics.<n>The loss of the vision tokenization can be the representation bottleneck for target tasks.<n>We propose ETT, an end-to-end vision tokenizer tuning approach that enables joint optimization between vision tokenization and target autoregressive tasks.
arXiv Detail & Related papers (2025-05-15T17:59:39Z)
Post-pre-training for Modality Alignment in Vision-Language Foundation Models [12.110530026601968]
This paper presents CLIP-Refine, a post-pre-training method for CLIP models at a phase between pre-training and fine-tuning.<n>It aims to align the feature space with 1 epoch training on small image-text datasets without zero-shot performance degradations.
arXiv Detail & Related papers (2025-04-17T07:46:19Z)
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations [2.992602379681373]
We show that multi-modal fine-tuning can achieve notable OoDD performance.<n>We propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data.
arXiv Detail & Related papers (2025-03-24T16:00:21Z)
SPARC: Score Prompting and Adaptive Fusion for Zero-Shot Multi-Label Recognition in Vision-Language Models [74.40683913645731]
Zero-shot multi-label recognition (MLR) with Vision-Language Models (VLMs) faces significant challenges without training data, model tuning, or architectural modifications.<n>Our work proposes a novel solution treating VLMs as black boxes, leveraging scores without training data or ground truth.<n>Analysis of these prompt scores reveals VLM biases and AND''/OR' signal ambiguities, notably that maximum scores are surprisingly suboptimal compared to second-highest scores.
arXiv Detail & Related papers (2025-02-24T07:15:05Z)
Bringing Masked Autoencoders Explicit Contrastive Properties for Point Cloud Self-Supervised Learning [116.75939193785143]
Contrastive learning (CL) for Vision Transformers (ViTs) in image domains has achieved performance comparable to CL for traditional convolutional backbones. In 3D point cloud pretraining with ViTs, masked autoencoder (MAE) modeling remains dominant.
arXiv Detail & Related papers (2024-07-08T12:28:56Z)
Spatio-Temporal Side Tuning Pre-trained Foundation Models for Video-based Pedestrian Attribute Recognition [58.79807861739438]
Existing pedestrian recognition (PAR) algorithms are mainly developed based on a static image. We propose to understand human attributes using video frames that can fully use temporal information.
arXiv Detail & Related papers (2024-04-27T14:43:32Z)
Mind the Modality Gap: Towards a Remote Sensing Vision-Language Model via Cross-modal Alignment [4.682326604942316]
We focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks.<n>There are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery.<n>We propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP.
arXiv Detail & Related papers (2024-02-15T09:31:07Z)
ContrastMotion: Self-supervised Scene Motion Learning for Large-Scale LiDAR Point Clouds [21.6511040107249]
We propose a novel self-supervised motion estimator for LiDAR-based autonomous driving via BEV representation. We predict scene motion via feature-level consistency between pillars in consecutive frames, which can eliminate the effect caused by noise points and view-changing point clouds in dynamic scenes.
arXiv Detail & Related papers (2023-04-25T05:46:24Z)
Black Box Few-Shot Adaptation for Vision-Language models [41.49584259596654]
Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. We describe a black-box method for V-L few-shot adaptation that operates on pre-computed image and text features. We propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain.
arXiv Detail & Related papers (2023-04-04T12:42:29Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.