Towards Understanding Multimodal Fine-Tuning: Spatial Features
- URL: http://arxiv.org/abs/2602.08713v1
- Date: Fri, 06 Feb 2026 18:48:18 GMT
- Title: Towards Understanding Multimodal Fine-Tuning: Spatial Features
- Authors: Lachin Naghashyar, Hunar Batra, Ashkan Khakzar, Philip Torr, Ronald Clark, Christian Schroeder de Witt, Constantin Venhoff,
- Abstract summary: Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model.<n>We present the first mechanistic analysis of VLM adaptation using stage-wise model diffing.
- Score: 25.349396112139214
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Contemporary Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model, fine-tuned for visual-text inputs. Yet despite these gains, it remains unclear how language backbone representations adapt during multimodal training and when vision-specific capabilities emerge. In this work, we present the first mechanistic analysis of VLM adaptation. Using stage-wise model diffing, a technique that isolates representational changes introduced during multimodal fine-tuning, we reveal how a language model learns to "see". We first identify vision-preferring features that emerge or reorient during fine-tuning. We then show that a selective subset of these features reliably encodes spatial relations, revealed through controlled shifts to spatial prompts. Finally, we trace the causal activation of these features to a small group of attention heads. Our findings show that stage-wise model diffing reveals when and where spatially grounded multimodal features arise. It also provides a clearer view of modality fusion by showing how visual grounding reshapes features that were previously text-only. This methodology enhances the interpretability of multimodal training and provides a foundation for understanding and refining how pretrained language models acquire vision-grounded capabilities.
Related papers
- Beyond Language Modeling: An Exploration of Multimodal Pretraining [125.34714978184638]
We provide empirical clarity through controlled, from-scratch pretraining experiments.<n>We adopt the Transfusion framework, using next-token prediction for language and diffusion for vision.<n>We demonstrate that the MoE architecture harmonizes this scaling asymmetry by providing the high model capacity required by language.
arXiv Detail & Related papers (2026-03-03T18:58:00Z) - Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - How Visual Representations Map to Language Feature Space in Multimodal LLMs [9.880509106657009]
We study the mechanism by which vision-language models (VLMs) achieve alignment of visual and linguistic representations.<n>By keeping the language model frozen, we ensure it maintains its original language representations without adaptation to visual data.<n>We reveal the layer-wise progression through which visual representations gradually align with language feature representations, converging in middle-to-later layers.
arXiv Detail & Related papers (2025-06-13T17:34:05Z) - Perceiving Beyond Language Priors: Enhancing Visual Comprehension and Attention in Multimodal Models [1.9253106218929117]
Multimodal Large Language Models (MLLMs) often fail to fully leverage visual input, defaulting to strong language priors.<n>Our approach first provides insights into how MLLMs internally build visual understanding of image regions and then introduces techniques to amplify this capability.<n>We demonstrate the superior multimodal understanding of our resultant model through a detailed upstream analysis quantifying its ability to predict visually-dependent tokens as well as 10 pt boost on visually challenging tasks.
arXiv Detail & Related papers (2025-05-08T20:04:27Z) - Multi-Faceted Multimodal Monosemanticity [42.64636740703632]
We take a data-driven approach to analyze interpretable, monosemantic features extracted from deep multimodal models.<n>Specifically, we investigate CLIP, a prominent visual-language representation model trained on massive image-text pairs.<n>We develop a set of multi-modal interpretability tools and measures designed to disentangle and analyze features learned from CLIP.
arXiv Detail & Related papers (2025-02-16T14:51:07Z) - Analyzing Finetuning Representation Shift for Multimodal LLMs Steering [56.710375516257876]
We propose to map hidden states to interpretable visual and textual concepts.<n>This enables us to more efficiently compare certain semantic dynamics, such as the shift from an original and fine-tuned model.<n>We also demonstrate the use of shift vectors to capture these concepts changes.
arXiv Detail & Related papers (2025-01-06T13:37:13Z) - Improving In-Context Learning in Diffusion Models with Visual
Context-Modulated Prompts [83.03471704115786]
We introduce improved Prompt Diffusion (iPromptDiff) in this study.
iPromptDiff integrates an end-to-end trained vision encoder that converts visual context into an embedding vector.
We show that a diffusion-based vision foundation model, when equipped with this visual context-modulated text guidance and a standard ControlNet structure, exhibits versatility and robustness across a variety of training tasks.
arXiv Detail & Related papers (2023-12-03T14:15:52Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - SINC: Self-Supervised In-Context Learning for Vision-Language Tasks [64.44336003123102]
We propose a framework to enable in-context learning in large language models.
A meta-model can learn on self-supervised prompts consisting of tailored demonstrations.
Experiments show that SINC outperforms gradient-based methods in various vision-language tasks.
arXiv Detail & Related papers (2023-07-15T08:33:08Z) - Localization vs. Semantics: Visual Representations in Unimodal and
Multimodal Models [57.08925810659545]
We conduct a comparative analysis of the visual representations in existing vision-and-language models and vision-only models.
Our empirical observations suggest that vision-and-language models are better at label prediction tasks.
We hope our study sheds light on the role of language in visual learning, and serves as an empirical guide for various pretrained models.
arXiv Detail & Related papers (2022-12-01T05:00:18Z) - DiMBERT: Learning Vision-Language Grounded Representations with
Disentangled Multimodal-Attention [101.99313208598569]
Vision-and-language (V-L) tasks require the system to understand both vision content and natural language.
We propose DiMBERT (short for Disentangled Multimodal-Attention BERT), which applies separated attention spaces for vision and language.
We show that DiMBERT sets new state-of-the-art performance on three tasks.
arXiv Detail & Related papers (2022-10-28T23:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.