DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
- URL: http://arxiv.org/abs/2603.01111v1
- Date: Sun, 01 Mar 2026 13:52:51 GMT
- Title: DeAR: Fine-Grained VLM Adaptation by Decomposing Attention Head Roles
- Authors: Yiming Ma, Hongkun Yang, Lionel Z. Wang, Bin Chen, Weizhi Xian, Jianzhi Teng,
- Abstract summary: We propose textbfDeAR, a framework that achieves fine-grained VLM adaptation by textbfDecomposing textbfAttention head textbfRoles.<n>We show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
- Score: 8.564506908883667
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompt learning is a dominant paradigm for adapting pre-trained Vision-Language Models (VLMs) to downstream tasks. However, existing methods often rely on a simplistic, layer-centric view, assuming shallow layers capture general features while deep layers handle task-specific knowledge. This assumption results in uncontrolled interactions between learnable tokens and original tokens. Task-specific knowledge could degrades the model's core generalization and creates a trade-off between task adaptation and the preservation of zero-shot generalization. To address this, we challenge the layer-centric view and propose \textbf{DeAR}, a framework that achieves fine-grained VLM adaptation by \textbf{De}composing \textbf{A}ttention head \textbf{R}oles. We posit that the functional specialization within VLMs occurs not between layers, but at the finer-grained level of individual attention heads in the deeper layers. Based on this insight, we introduce a novel metric, Concept Entropy, to systematically classify attention heads into distinct functional roles: \textit{Attribute}, \textit{Generalization}, and \textit{Mixed}. Guided by these roles, we introduce specialized attribute tokens and a Role-Based Attention Mask mechanism to precisely control information flow, ensuring generalization heads remain isolated from task-specific knowledge. We further incorporate a Task-Adaptive Fusion Strategy for inference. Extensive experiments on fifteen datasets show that DeAR achieves a strong balance between task adaptation and generalization, outperforming previous methods across various tasks.
Related papers
- Do All Individual Layers Help? An Empirical Study of Task-Interfering Layers in Vision-Language Models [51.754991950934375]
In a pretrained VLM, all layers are engaged by default to make predictions on downstream tasks.<n>We find that intervening on a single layer, such as by zeroing its parameters, can improve the performance on certain tasks.<n>We propose TaLo, a training-free, test-time adaptation method that dynamically identifies and bypasses the most interfering layer for a given task.
arXiv Detail & Related papers (2026-02-01T11:37:05Z) - Beyond the final layer: Attentive multilayer fusion for vision transformers [45.627646781613386]
We show that task-relevant information is distributed across the network hierarchy rather than solely encoded in any of the last layers.<n>We apply an attentive probing mechanism that dynamically fuses representations from all layers of a Vision Transformer.<n>This mechanism learns to identify the most relevant layers for a target task and combines low-level structural cues with high-level semantic abstractions.
arXiv Detail & Related papers (2026-01-14T09:50:09Z) - RECALL: REpresentation-aligned Catastrophic-forgetting ALLeviation via Hierarchical Model Merging [33.22889542330089]
Internal representations in large language models (LLMs) serve as reliable proxies of learned knowledge.<n>We propose RECALL, a representation-aware model merging framework for continual learning without access to historical data.
arXiv Detail & Related papers (2025-10-23T12:17:37Z) - Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting [70.83781268763215]
Vision-language models (VLMs) have achieved impressive performance across diverse multimodal tasks by leveraging large-scale pre-training.<n>VLMs face unique challenges such as cross-modal feature drift, parameter interference due to shared architectures, and zero-shot capability erosion.<n>This survey aims to serve as a comprehensive and diagnostic reference for researchers developing lifelong vision-language systems.
arXiv Detail & Related papers (2025-08-06T09:03:10Z) - MMRL++: Parameter-Efficient and Interaction-Aware Representation Learning for Vision-Language Models [4.828668077793944]
Multi-Modal Representation Learning generates space tokens projected into both text and image encoders as representation tokens.<n>MML++ is a parameter-efficient and interaction-aware extension that significantly reduces trainable parameters.<n> experiments on 15 datasets demonstrate that MMRL and MMRL++ consistently outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-05-15T08:43:53Z) - Learning Task Representations from In-Context Learning [67.66042137487287]
Large language models (LLMs) have demonstrated remarkable proficiency in in-context learning (ICL)<n>We introduce an automated formulation for encoding task information in ICL prompts as a function of attention heads.<n>The proposed method successfully extracts task-specific information from in-context demonstrations and excels in both text and regression tasks.
arXiv Detail & Related papers (2025-02-08T00:16:44Z) - Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach [77.65459419417533]
We propose an automated dataset expansion technique to support semantics-oriented DeepFake detection tasks.<n>We also resort to the joint embedding of face images and labels (depicted by text descriptions) for prediction.<n>Our method improves the generalizability of DeepFake detection and renders some degree of model interpretation by providing human-understandable explanations.
arXiv Detail & Related papers (2024-08-29T07:11:50Z) - Fully Fine-tuned CLIP Models are Efficient Few-Shot Learners [8.707819647492467]
We explore capturing the task-specific information via meticulous refinement of entire Vision-Language Models (VLMs)
To mitigate these issues, we propose a framework named CLIP-CITE via designing a discriminative visual-text task.
arXiv Detail & Related papers (2024-07-04T15:22:54Z) - Foundation Policies with Hilbert Representations [54.44869979017766]
We propose an unsupervised framework to pre-train generalist policies from unlabeled offline data.
Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment.
Our experiments show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion.
arXiv Detail & Related papers (2024-02-23T19:09:10Z) - HCVP: Leveraging Hierarchical Contrastive Visual Prompt for Domain
Generalization [69.33162366130887]
Domain Generalization (DG) endeavors to create machine learning models that excel in unseen scenarios by learning invariant features.
We introduce a novel method designed to supplement the model with domain-level and task-specific characteristics.
This approach aims to guide the model in more effectively separating invariant features from specific characteristics, thereby boosting the generalization.
arXiv Detail & Related papers (2024-01-18T04:23:21Z) - Leveraging sparse and shared feature activations for disentangled
representation learning [112.22699167017471]
We propose to leverage knowledge extracted from a diversified set of supervised tasks to learn a common disentangled representation.
We validate our approach on six real world distribution shift benchmarks, and different data modalities.
arXiv Detail & Related papers (2023-04-17T01:33:24Z) - Distribution Matching for Heterogeneous Multi-Task Learning: a
Large-scale Face Study [75.42182503265056]
Multi-Task Learning has emerged as a methodology in which multiple tasks are jointly learned by a shared learning algorithm.
We deal with heterogeneous MTL, simultaneously addressing detection, classification & regression problems.
We build FaceBehaviorNet, the first framework for large-scale face analysis, by jointly learning all facial behavior tasks.
arXiv Detail & Related papers (2021-05-08T22:26:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.