Related papers: What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?

URL: http://arxiv.org/abs/2510.01719v2
Date: Tue, 07 Oct 2025 15:17:04 GMT
Title: What MLLMs Learn about When they Learn about Multimodal Reasoning: Perception, Reasoning, or their Integration?
Authors: Jiwan Chung, Neel Joshi, Pratyusha Sharma, Youngjae Yu, Vibhav Vineet,
Abstract summary: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry.<n>We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning.
Score: 46.836858357488296
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multimodal reasoning models have recently shown promise on challenging domains such as olympiad-level geometry, yet their evaluation remains dominated by aggregate accuracy, a single score that obscures where and how models are improving. We introduce MathLens, a benchmark designed to disentangle the subskills of multimodal reasoning while preserving the complexity of textbook-style geometry problems. The benchmark separates performance into three components: Perception: extracting information from raw inputs, Reasoning: operating on available information, and Integration: selecting relevant perceptual evidence and applying it within reasoning. To support each test, we provide annotations: visual diagrams, textual descriptions to evaluate reasoning in isolation, controlled questions that require both modalities, and probes for fine-grained perceptual skills, all derived from symbolic specifications of the problems to ensure consistency and robustness. Our analysis reveals that different training approaches have uneven effects: First, reinforcement learning chiefly strengthens perception, especially when supported by textual supervision, while textual SFT indirectly improves perception through reflective reasoning. Second, reasoning improves only in tandem with perception. Third, integration remains the weakest capacity, with residual errors concentrated there once other skills advance. Finally, robustness diverges: RL improves consistency under diagram variation, whereas multimodal SFT reduces it through overfitting. We will release all data and experimental logs.

Related papers

Can Unified Generation and Understanding Models Maintain Semantic Equivalence Across Different Output Modalities? [61.533560295383786]
Unified Multimodal Large Language Models (U-MLLMs) integrate understanding and generation within a single architecture.<n>We observe that U-MLLMs fail to maintain semantic equivalence when required to render the same results in the image modality.<n>We introduce VGUBench, a framework to decouple reasoning logic from generation fidelity.
arXiv Detail & Related papers (2026-02-27T06:23:56Z)
CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation [6.356820150960838]
We introduce two complementary approaches inspired by test-time scaling to stabilize vision-language models.<n>CASHEW is an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces.<n>CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence.
arXiv Detail & Related papers (2026-01-12T21:24:45Z)
Ground What You See: Hallucination-Resistant MLLMs via Caption Feedback, Diversity-Aware Sampling, and Conflict Regularization [38.469173375694076]
This paper systematically analyzes the root causes of hallucinations in Multimodal Large Language Models (MLLMs)<n>It identifies three critical factors: (1) an over-reliance on chained visual reasoning, where inaccurate initial descriptions anchor subsequent inferences to incorrect premises; (2) insufficient exploration diversity during policy optimization, leading the model to generate overly confident but erroneous outputs; and (3) destructive conflicts between training samples, where NTK similarity causes false associations and unstable parameter updates.<n> Experimental results demonstrate that our proposed method significantly reduces hallucination rates and effectively enhances the inference accuracy of MLLMs.
arXiv Detail & Related papers (2026-01-09T07:59:18Z)
Beyond Generation: Multi-Hop Reasoning for Factual Accuracy in Vision-Language Models [0.0]
Visual Language Models (VLMs) are powerful generative tools but often produce factually in- accurate outputs.<n>This work introduces a framework for knowledge-guided reasoning inVLMs, leverag- ing structured knowledge graphs for multi-hop verification.<n>We evaluate the framework using hi- erarchical, triple-based and bullet-point based knowledge representations, analyzing their ef-fectiveness in factual accuracy and logical infer- ence.
arXiv Detail & Related papers (2025-11-25T17:34:32Z)
Compose and Fuse: Revisiting the Foundational Bottlenecks in Multimodal Reasoning [49.17801010041155]
Multimodal large language models (MLLMs) promise enhanced reasoning by integrating diverse inputs such as text, vision, and audio.<n>Yet cross-modal reasoning remains underexplored, with conflicting reports on whether added modalities help or harm performance.<n>We categorize multimodal reasoning into six interaction patterns, varying how facts are distributed across modalities and logically combined.
arXiv Detail & Related papers (2025-09-28T08:46:11Z)
VisioMath: Benchmarking Figure-based Mathematical Reasoning in LMMs [31.007061220012954]
We present groundingMath, a curated benchmark of 1,800 high-quality K-12 mathematics problems in which all candidate answers are diagrams with subtle visual similarities.<n>A comprehensive evaluation of state-of-the-art LMMs, covering both leading closed-source systems and widely adopted open-source models, reveals a consistent decline in accuracy as inter-image similarity increases.<n>We explore three alignment-oriented strategies, spanning training-free approaches and finetuning, to achieve substantial accuracy gains.
arXiv Detail & Related papers (2025-06-07T09:24:13Z)
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z)
Reason-RFT: Reinforcement Fine-Tuning for Visual Reasoning [19.28434717501445]
Visual reasoning abilities play a crucial role in understanding complex multimodal data.<n>Existing methods improve VLM reasoning via Chain-of-Thought supervised fine-tuning.<n>We propose Reason-RFT, a novel reinforcement fine-tuning framework.
arXiv Detail & Related papers (2025-03-26T17:38:06Z)
VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning [63.0285363282581]
Multimodal Large Language Models (MLLMs) have become a powerful tool for integrating visual and textual information.<n>We introduce VOILA, a benchmark designed to evaluate MLLMs' perceptual understanding and abstract relational reasoning.<n>We reveal that current MLLMs struggle to comprehend inter-image relationships and exhibit limited capabilities in high-level relational reasoning.
arXiv Detail & Related papers (2025-02-25T23:36:19Z)
STEP: Enhancing Video-LLMs' Compositional Reasoning by Spatio-Temporal Graph-guided Self-Training [87.58996020705258]
Video Large Language Models (Video-LLMs) have recently shown strong derivation in basic video understanding tasks.<n>Video-LLMs struggle with compositional reasoning that requires multi-step explicit-temporal inference across object relations, interactions and events.<n>We propose STEP, a novel graph-guided self-training method that enables VideoLLMs to generate reasoning-rich finetuning data from any raw videos to improve itself.
arXiv Detail & Related papers (2024-11-29T11:54:55Z)
REX: Reasoning-aware and Grounded Explanation [30.392986232906107]
We develop a new type of multi-modal explanations that explain the decisions by traversing the reasoning process and grounding keywords in the images. Second, we identify the critical need to tightly couple important components across the visual and textual modalities for explaining the decisions. Third, we propose a novel explanation generation method that explicitly models the pairwise correspondence between words and regions of interest.
arXiv Detail & Related papers (2022-03-11T17:28:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.