Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
- URL: http://arxiv.org/abs/2511.17487v1
- Date: Fri, 21 Nov 2025 18:43:01 GMT
- Title: Downscaling Intelligence: Exploring Perception and Reasoning Bottlenecks in Small Multimodal Models
- Authors: Mark Endo, Serena Yeung-Levy,
- Abstract summary: We study how reduced large language model (LLM) capacity affects multimodal capabilities.<n>LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM.<n>We introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks.
- Score: 13.301879353093398
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling up multimodal models has enabled remarkable advances in visual understanding and reasoning, but practical demands call for smaller, efficient systems. In this work, we conduct a principled analysis of downscaling intelligence in multimodal models, examining how reduced large language model (LLM) capacity affects multimodal capabilities. Our initial findings reveal an interesting trend: LLM downscaling disproportionately affects visual capabilities, rather than abilities inherited from the LLM. We then examine whether this drop mainly reflects the expected decline in visual reasoning or a more fundamental loss of perceptual abilities. Isolating the effect of LLM downscaling on perception, we find performance still drops sharply, often matching or exceeding the impact on reasoning. To address this bottleneck, we introduce visual extraction tuning, which explicitly trains the model to extract instruction-relevant visual details consistently across tasks. With these extracted visual details, we then apply step-by-step reasoning to generate answers. Together, these components form our Extract+Think approach, setting a new standard for efficiency and performance in this space.
Related papers
- Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs [55.61018839017648]
Chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks.<n>Existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies.<n>We propose SAYO, a visual reasoning model trained with a reinforcement learning framework that introduces a region-level visual attention-based reward.
arXiv Detail & Related papers (2026-02-09T03:33:23Z) - Unleashing Perception-Time Scaling to Multimodal Reasoning Models [60.578179197783754]
Recent advances in inference-time scaling have substantially enhanced the reasoning capabilities of Large Vision-Language Models (LVLMs)<n>Inspired by this success, similar strategies have been applied to multimodal reasoning, yet their impact on visual perception remains unclear.<n>We propose Perception-Time Scaling (PTS), a novel paradigm that encourages token-rich perception and decomposes complex perception problems into intermediate tractable sub-problems.
arXiv Detail & Related papers (2025-10-10T03:17:52Z) - Agentic Jigsaw Interaction Learning for Enhancing Visual Perception and Reasoning in Vision-Language Models [63.69856480318313]
AGILE formulates jigsaw solving as an interactive process, enabling the model to progressively engage with the environment.<n>We show that AGILE substantially boosts performance on jigsaw tasks of varying complexity.<n>We also demonstrate strong generalization across 9 general vision tasks, achieving an average improvement of 3.1%.
arXiv Detail & Related papers (2025-10-01T17:58:05Z) - Training-free Uncertainty Guidance for Complex Visual Tasks with MLLMs [61.64185573373394]
We propose a training-free framework that uses an MLLM's intrinsic uncertainty as a proactive guidance signal.<n>We introduce a unified mechanism that scores candidate visual inputs by response uncertainty, enabling the model to autonomously focus on the most salient data.<n>Our work validates that harnessing intrinsic uncertainty is a powerful, general strategy for enhancing fine-grained multimodal performance.
arXiv Detail & Related papers (2025-10-01T09:20:51Z) - More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models [17.431298099935344]
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs)<n>Recent research has sought to extend reasoning to Vision-Language Models (VLMs)<n>Our study uncovers the dual nature of multimodal reasoning, leading to recognition failures on otherwise basic visual questions.<n>We propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories.
arXiv Detail & Related papers (2025-09-30T06:37:47Z) - Revisiting Visual Understanding in Multimodal Reasoning through a Lens of Image Perturbation [38.740533834549716]
We show that language-only models can achieve comparable or even better performance than MLLMs that consume raw visual inputs.<n>Motivated by this, we propose a simple visual perturbation framework that enhances perceptual robustness without requiring algorithmic modifications.<n>Our findings highlight the critical role of visual perturbation in multimodal mathematical reasoning.
arXiv Detail & Related papers (2025-06-11T13:39:46Z) - Perception-R1: Advancing Multimodal Reasoning Capabilities of MLLMs via Visual Perception Reward [77.34936657745578]
We propose Perception-R1, which introduces a novel visual perception reward that explicitly encourages MLLMs to perceive the visual content accurately.<n>We show that Perception-R1 achieves state-of-the-art performance on most benchmarks using only 1,442 training data.
arXiv Detail & Related papers (2025-06-08T16:48:42Z) - FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance [9.782362715017596]
We introduce FOLDER, a simple yet effective plug-and-play module designed to reduce the length of the visual token sequence.<n>We analyze the information loss introduced by different reduction strategies and develop FOLDER to preserve key information while removing visual redundancy.<n>FOLDER achieves comparable or even better performance than the original models, while dramatically reducing complexity by removing up to 70% of visual tokens.
arXiv Detail & Related papers (2025-01-05T03:28:45Z) - Multi-modal Auto-regressive Modeling via Visual Words [96.25078866446053]
We propose the concept of visual tokens, which maps the visual features to probability distributions over Large Multi-modal Models' vocabulary.
We further explore the distribution of visual features in the semantic space within LMM and the possibility of using text embeddings to represent visual information.
arXiv Detail & Related papers (2024-03-12T14:58:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.