Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
- URL: http://arxiv.org/abs/2505.05464v1
- Date: Thu, 08 May 2025 17:56:23 GMT
- Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
- Authors: Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He,
- Abstract summary: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs)<n>In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models.<n>We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers.
- Score: 32.70038648928894
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
Related papers
- Mechanistic Indicators of Understanding in Large Language Models [2.752171077382186]
We argue that Large Language Models (LLMs) develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections.<n> conceptual understanding emerges when a model forms "features" as directions in latent space, learning the connections between diverse manifestations of something.<n>Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world.<n>Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a "circuit" connecting these facts.
arXiv Detail & Related papers (2025-07-07T20:26:31Z) - From Black Boxes to Transparent Minds: Evaluating and Enhancing the Theory of Mind in Multimodal Large Language Models [17.235722538085263]
This study adopts an approach based on internal mechanisms to provide an interpretability-driven assessment of Theory of Mind (ToM) in large language models (MLLMs)<n>We first construct a multimodal ToM test dataset, GridToM, which incorporates diverse belief testing tasks and perceptual information from multiple perspectives.<n>Next, our analysis shows that attention heads in multimodal large models can distinguish cognitive information across perspectives, providing evidence of ToM capabilities.
arXiv Detail & Related papers (2025-06-17T06:27:42Z) - From Generation to Generalization: Emergent Few-Shot Learning in Video Diffusion Models [65.0487600936788]
Video Diffusion Models (VDMs) have emerged as powerful generative tools capable of synthesizing high-quality content.<n>We argue that VDMs naturally push to probe structured representations and an implicit understanding of the visual world.<n>Our method transforms each task into a visual transition, enabling the training of LoRA weights on short input- sequences.
arXiv Detail & Related papers (2025-06-08T20:52:34Z) - Embodied-R: Collaborative Framework for Activating Embodied Spatial Reasoning in Foundation Models via Reinforcement Learning [58.86928947970342]
Embodied-R is a framework combining large-scale Vision-Language Models for perception and small-scale Language Models for reasoning.<n>After training on only 5k embodied video samples, Embodied-R with a 3B LM matches state-of-the-art multimodal reasoning models.<n>Embodied-R also exhibits emergent thinking patterns such as systematic analysis and contextual integration.
arXiv Detail & Related papers (2025-04-17T06:16:11Z) - A Survey on Mechanistic Interpretability for Multi-Modal Foundation Models [74.48084001058672]
The rise of foundation models has transformed machine learning research.<n> multimodal foundation models (MMFMs) pose unique interpretability challenges beyond unimodal frameworks.<n>This survey explores two key aspects: (1) the adaptation of LLM interpretability methods to multimodal models and (2) understanding the mechanistic differences between unimodal language models and crossmodal systems.
arXiv Detail & Related papers (2025-02-22T20:55:26Z) - Analyzing Fine-tuning Representation Shift for Multimodal LLMs Steering alignment [53.90425382758605]
We show how fine-tuning alters the internal structure of a model to specialize in new multimodal tasks.<n>Our work sheds light on how multimodal representations evolve through fine-tuning and offers a new perspective for interpreting model adaptation in multimodal tasks.
arXiv Detail & Related papers (2025-01-06T13:37:13Z) - Large Multi-modal Models Can Interpret Features in Large Multi-modal Models [45.509307983813336]
We first apply a Sparse Autoencoder to disentangle the representations into human understandable features.
We then present an automatic interpretation framework to interpreted the open-semantic features learned in SAE by the LMMs themselves.
arXiv Detail & Related papers (2024-11-22T14:41:36Z) - Unconstrained Model Merging for Enhanced LLM Reasoning [42.079040543428036]
We explore the potential of merging multiple expert models into a single large language model.
We propose an unconstrained model merging framework that accommodates both homogeneous and heterogeneous model architectures.
Across 7 benchmarks and 9 reasoning-optimized LLMs, we reveal key findings that reasoning emerges from merging.
arXiv Detail & Related papers (2024-10-17T16:04:07Z) - Fact :Teaching MLLMs with Faithful, Concise and Transferable Rationales [102.54274021830207]
We introduce Fact, a novel paradigm designed to generate multimodal rationales that are faithful, concise, and transferable for teaching MLLMs.
We filter rationales that can be transferred to end-to-end paradigms from programming paradigms to guarantee transferability.
Our approach also reduces hallucinations owing to its high correlation between images and text.
arXiv Detail & Related papers (2024-04-17T07:20:56Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.