Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
- URL: http://arxiv.org/abs/2601.08151v1
- Date: Tue, 13 Jan 2026 02:26:21 GMT
- Title: Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
- Authors: Shezheng Song, Shasha Li, Jie Yu,
- Abstract summary: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding.<n>We perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs.<n>We introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts.
- Score: 7.511262066889113
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multimodal Large Language Models (MLLMs) have achieved remarkable progress in vision-language understanding, yet how they internally integrate visual and textual information remains poorly understood. To bridge this gap, we perform a systematic layer-wise masking analysis across multiple architectures, revealing how visual-text fusion evolves within MLLMs. The results show that fusion emerges at several specific layers rather than being uniformly distributed across the network, and certain models exhibit a late-stage "review" phenomenon where visual signals are reactivated before output generation. Besides, we further analyze layer-wise attention evolution and observe persistent high-attention noise on irrelevant regions, along with gradually increasing attention on text-aligned areas. Guided by these insights, we introduce a training-free contrastive attention framework that models the transformation between early fusion and final layers to highlight meaningful attention shifts. Extensive experiments across various MLLMs and benchmarks validate our analysis and demonstrate that the proposed approach improves multimodal reasoning performance. Code will be released.
Related papers
- Stateful Cross-layer Vision Modulation [19.730096071316876]
multimodal large language models (MLLMs) widely adopt multi-layer visual feature fusion to enhance visual representation.<n>Existing approaches typically perform static concatenation or weighted aggregation after visual encoding, without intervening in the representation formation process itself.<n>We propose a cross-layer memory-modulated vision framework(SCVM) to address these limitations.
arXiv Detail & Related papers (2026-02-28T13:57:19Z) - Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - Multimodal Information Fusion for Chart Understanding: A Survey of MLLMs -- Evolution, Limitations, and Cognitive Enhancement [25.08967298618286]
Multimodal Large Language Models (MLLMs) are transforming chart information fusion.<n>This survey aims to equip researchers and practitioners with a structured understanding of how MLLMs are transforming chart information fusion.
arXiv Detail & Related papers (2026-02-08T12:59:50Z) - Towards Understanding Multimodal Fine-Tuning: Spatial Features [25.349396112139214]
Vision-Language Models (VLMs) achieve strong performance on a wide range of tasks by pairing a vision encoder with a pre-trained language model.<n>We present the first mechanistic analysis of VLM adaptation using stage-wise model diffing.
arXiv Detail & Related papers (2026-02-06T18:48:18Z) - PENDULUM: A Benchmark for Assessing Sycophancy in Multimodal Large Language Models [43.767942065379366]
Sycophancy is a tendency of AI models to agree with user input at the expense of factual accuracy or in contradiction of visual evidence.<n>We introduce a comprehensive evaluation benchmark, textitPENDULUM, comprising approximately 2,000 human-curated Visual Question Answering pairs.<n>We observe substantial variability in model robustness and a pronounced susceptibility to sycophantic and hallucinatory behavior.
arXiv Detail & Related papers (2025-12-22T12:49:12Z) - Unleashing the Intrinsic Visual Representation Capability of Multimodal Large Language Models [58.91911788912665]
We propose Latent Visual Reconstruction (LaVer), a novel training framework that facilitates MLLMs in learning more discrimi visual representations.<n>Our method offers direct visual activation to MLLMs, which exhibit increased visual attention allocation, indicating enhanced utilization of visual information.
arXiv Detail & Related papers (2025-12-06T04:20:13Z) - Attention Guided Alignment in Efficient Vision-Language Models [56.20286899428444]
Large Vision-Language Models (VLMs) rely on effective multimodal alignment between pre-trained vision encoders and Large Language Models (LLMs)<n>This paper presents a comprehensive analysis of attention patterns in efficient VLMs.<n>We introduce Attention-Guided Efficient Vision-Language Models (AGE-VLM), a novel framework that enhances visual grounding through interleaved cross-attention layers.
arXiv Detail & Related papers (2025-11-21T21:36:48Z) - Multimodal Language Models See Better When They Look Shallower [54.5303326937134]
Multimodal large language models (MLLMs) typically extract visual features from the final layers of a pretrained Vision Transformer (ViT)<n>We present the first comprehensive study of visual layer selection for MLLMs, analyzing representation similarity across ViT layers.<n>We find that while deep layers excel in semantic-rich tasks like OCR, shallow and middle layers significantly outperform them on fine-grained visual tasks.
arXiv Detail & Related papers (2025-04-30T09:07:10Z) - Cross-modal Information Flow in Multimodal Large Language Models [14.853197288189579]
We investigate the information flow between different modalities -- language and vision -- in large language models (MLLMs)<n>We find that there are two distinct stages in the process of integration of the two modalities.<n>Our findings provide a new and comprehensive perspective on the spatial and functional aspects of image and language processing in the MLLMs.
arXiv Detail & Related papers (2024-11-27T18:59:26Z) - Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders [89.41055673919895]
This study explores the design space for MLLMs using a mixture of vision encoders and resolutions.<n>We discover that simply concatenating visual tokens from a set of complementary vision encoders is as effective as more complex mixing architectures or strategies.<n>The resulting family of MLLMs, Eagle, surpasses other leading open-source models on major MLLM benchmarks.
arXiv Detail & Related papers (2024-08-28T17:59:31Z) - From Redundancy to Relevance: Information Flow in LVLMs Across Reasoning Tasks [33.476693301050275]
We conduct experiments with truncation strategies across various LVLMs for visual question answering and image captioning tasks.
By exploring the information flow from the perspective of visual representation contribution, we observe that it tends to converge in shallow layers but diversify in deeper layers.
arXiv Detail & Related papers (2024-06-04T13:52:54Z) - Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models [121.83413400686139]
This paper proposes to improve the visual perception ability of MLLMs through a mixture-of-experts knowledge enhancement mechanism.
We introduce a novel method that incorporates multi-task encoders and visual tools into the existing MLLMs training and inference pipeline.
arXiv Detail & Related papers (2024-01-06T02:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.