Head Pursuit: Probing Attention Specialization in Multimodal Transformers
- URL: http://arxiv.org/abs/2510.21518v1
- Date: Fri, 24 Oct 2025 14:41:47 GMT
- Title: Head Pursuit: Probing Attention Specialization in Multimodal Transformers
- Authors: Lorenzo Basile, Valentino Maiorca, Diego Doimo, Francesco Locatello, Alberto Cazzaniga,
- Abstract summary: We study how individual attention heads in text-generative models specialize in specific semantic or visual attributes.<n>Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers.<n>Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output.
- Score: 32.218423952797444
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Language and vision-language models have shown impressive performance across a wide range of tasks, but their internal mechanisms remain only partly understood. In this work, we study how individual attention heads in text-generative models specialize in specific semantic or visual attributes. Building on an established interpretability method, we reinterpret the practice of probing intermediate activations with the final decoding layer through the lens of signal processing. This lets us analyze multiple samples in a principled way and rank attention heads based on their relevance to target concepts. Our results show consistent patterns of specialization at the head level across both unimodal and multimodal transformers. Remarkably, we find that editing as few as 1% of the heads, selected using our method, can reliably suppress or enhance targeted concepts in the model output. We validate our approach on language tasks such as question answering and toxicity mitigation, as well as vision-language tasks including image classification and captioning. Our findings highlight an interpretable and controllable structure within attention layers, offering simple tools for understanding and editing large-scale generative models.
Related papers
- VersaViT: Enhancing MLLM Vision Backbones via Task-Guided Optimization [87.26383908243878]
We show that vision encoders within Multimodal Large Language Models exhibit deficiencies in their dense feature representations.<n>We propose VersaViT, a well-rounded vision transformer that instantiates a novel multi-task framework for collaborative post-training.
arXiv Detail & Related papers (2026-02-10T16:08:19Z) - Towards Understanding Multimodal Fine-Tuning: Spatial Features [25.349396112139214]
Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model.<n>We present the first mechanistic analysis of VLM adaptation using stage-wise model diffing.
arXiv Detail & Related papers (2026-02-06T18:48:18Z) - UniPixel: Unified Object Referring and Segmentation for Pixel-Level Visual Reasoning [83.68366772745689]
We propose UniPixel, a large multi-modal model capable of flexibly comprehending visual prompt inputs and generating mask-grounded responses.<n>Specifically, UniPixel processes visual prompts and generates relevant masks on demand, and performs subsequent reasoning conditioning on these intermediate pointers during inference.<n>The effectiveness of our approach has been verified on 10 benchmarks across a diverse set of tasks, including pixel-level referring/segmentation and object-centric understanding in images/videos.
arXiv Detail & Related papers (2025-09-22T17:59:40Z) - Perception Encoder: The best visual embeddings are not at the output of the network [70.86738083862099]
We introduce Perception (PE), a vision encoder for image and video understanding trained via simple vision-language learning.<n>We find that contrastive vision-language training alone can produce strong, general embeddings for all of these downstream tasks.<n>Together, our PE family of models achieves best-in-class results on a wide variety of tasks.
arXiv Detail & Related papers (2025-04-17T17:59:57Z) - Multi-Faceted Multimodal Monosemanticity [42.64636740703632]
We take a data-driven approach to analyze interpretable, monosemantic features extracted from deep multimodal models.<n>Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs.<n>We develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP.
arXiv Detail & Related papers (2025-02-16T14:51:07Z) - The Geometry of Prompting: Unveiling Distinct Mechanisms of Task Adaptation in Language Models [40.128112851978116]
We study how different prompting methods affect the geometry of representations in language models.<n>Our analysis highlights the critical role of input distribution samples and label semantics in few-shot in-context learning.<n>Our work contributes to the theoretical understanding of large language models and lays the groundwork for developing more effective, representation-aware prompting strategies.
arXiv Detail & Related papers (2025-02-11T23:09:50Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Unveiling Visual Perception in Language Models: An Attention Head Analysis Approach [33.20992355312175]
Recent advancements in Multimodal Large Language Models (MLLMs) have demonstrated remarkable progress in visual understanding.<n>This paper aims to address this question with systematic investigation across 4 model families and 4 model scales.<n>Our analysis reveals a strong correlation between the behavior of these attention heads, the distribution of attention weights, and their concentration on visual tokens within the input.
arXiv Detail & Related papers (2024-12-24T02:31:24Z) - Picking the Underused Heads: A Network Pruning Perspective of Attention
Head Selection for Fusing Dialogue Coreference Information [50.41829484199252]
Transformer-based models with the multi-head self-attention mechanism are widely used in natural language processing.
We investigate the attention head selection and manipulation strategy for feature injection from a network pruning perspective.
arXiv Detail & Related papers (2023-12-15T05:27:24Z) - Do Vision-and-Language Transformers Learn Grounded Predicate-Noun
Dependencies? [0.06299766708197882]
We create a new task targeted at evaluating understanding of predicate-noun dependencies in a controlled setup.
We evaluate a range of state-of-the-art models and find that their performance on the task varies considerably.
This study highlights that targeted and controlled evaluations are a crucial step for a precise and rigorous test of the multimodal knowledge of vision-and-language models.
arXiv Detail & Related papers (2022-10-21T16:07:00Z) - Analyzing the Limits of Self-Supervision in Handling Bias in Language [52.26068057260399]
We evaluate how well language models capture the semantics of four tasks for bias: diagnosis, identification, extraction and rephrasing.
Our analyses indicate that language models are capable of performing these tasks to widely varying degrees across different bias dimensions, such as gender and political affiliation.
arXiv Detail & Related papers (2021-12-16T05:36:08Z) - Attention Flows: Analyzing and Comparing Attention Mechanisms in
Language Models [5.866941279460248]
We propose a visual analytics approach to understanding fine-tuning in attention-based language models.
Our visualization, Attention Flows, is designed to support users in querying, tracing, and comparing attention within layers, across layers, and amongst attention heads in Transformer-based language models.
arXiv Detail & Related papers (2020-09-03T19:56:30Z) - Behind the Scene: Revealing the Secrets of Pre-trained
Vision-and-Language Models [65.19308052012858]
Recent Transformer-based large-scale pre-trained models have revolutionized vision-and-language (V+L) research.
We present VALUE, a set of meticulously designed probing tasks to decipher the inner workings of multimodal pre-training.
Key observations: Pre-trained models exhibit a propensity for attending over text rather than images during inference.
arXiv Detail & Related papers (2020-05-15T01:06:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.