Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning
Abstract Overview
This paper argues that a central problem in multimodal reasoning is not only how visual information is represented, but when visual evidence is introduced during reasoning. The authors analyze limitations of two dominant paradigms: pre-reasoning visual-to-text conversion can discard fine-grained evidence, while unified vision-language reasoning can become dominated by linguistic priors and weaken image grounding. They propose CSMR, a framework in which an LLM maintains an explicit reasoning state, decides when to query an independent visual perception module, and determines when to stop reasoning. The approach is evaluated in zero-shot settings across multiple multimodal reasoning benchmarks and is supported by ablation and hallucination analyses.
Novelty
The work’s main novelty is a cognitive scheduling formulation for multimodal reasoning, where a language model explicitly controls the timing of visual evidence acquisition instead of relying on one-shot textualization or tightly coupled multimodal fusion. It is also distinctive in pairing this design with an empirical analysis that attributes failures in unified multimodal reasoning to language-dominant attention and insufficient visual-faithfulness constraints.
Results
In the Qwen2-based zero-shot setting, CSMR achieves the strongest reported results in the paper across all three benchmarks: 45.7 accuracy on M3CoT, 78.2 accuracy on ScienceQA, and 34.3 ROUGE-L on LLaVA-W. Ablations show performance drops when dynamic querying or flexible termination is removed, and the hallucination analysis reports a 9-percentage-point increase in non-hallucinated samples over DDCoT on a 200-example M3CoT subset.
Key Points
- The paper identifies two failure modes in existing multimodal reasoning: evidence loss from static visual-to-text conversion and weakened visual grounding from language-dominant unified representations.
- CSMR decouples reasoning and perception, using an LLM-based cognitive core to issue reasoning-state-driven visual queries and to terminate once sufficient evidence has been collected.
- Empirical results and ablations indicate that iterative visual querying and early termination contribute materially to accuracy, efficiency relative to DDCoT, and reduced hallucination.