Related papers: Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images

URL: http://arxiv.org/abs/2512.17306v1
Date: Fri, 19 Dec 2025 07:44:43 GMT
Title: Deep But Reliable: Advancing Multi-turn Reasoning for Thinking with Images
Authors: Wenhao Yang, Yu Xia, Jinlong Huang, Shiyin Lu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, Yuanyu Wan, Lijun Zhang,
Abstract summary: We propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT.<n>Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs.<n>In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern.<n>In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern.
Score: 53.373427633330515
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances in large Vision-Language Models (VLMs) have exhibited strong reasoning capabilities on complex visual tasks by thinking with images in their Chain-of-Thought (CoT), which is achieved by actively invoking tools to analyze visual inputs rather than merely perceiving them. However, existing models often struggle to reflect on and correct themselves when attempting incorrect reasoning trajectories. To address this limitation, we propose DRIM, a model that enables deep but reliable multi-turn reasoning when thinking with images in its multimodal CoT. Our pipeline comprises three stages: data construction, cold-start SFT and RL. Based on a high-resolution image dataset, we construct high-difficulty and verifiable visual question-answer pairs, where solving each task requires multi-turn tool calls to reach the correct answer. In the SFT stage, we collect tool trajectories as cold-start data, guiding a multi-turn reasoning pattern. In the RL stage, we introduce redundancy-penalized policy optimization, which incentivizes the model to develop a self-reflective reasoning pattern. The basic idea is to impose judgment on reasoning trajectories and penalize those that produce incorrect answers without sufficient multi-scale exploration. Extensive experiments demonstrate that DRIM achieves superior performance on visual understanding benchmarks.

Related papers

See It, Say It, Sorted: An Iterative Training-Free Framework for Visually-Grounded Multimodal Reasoning in LVLMs [24.90876091319589]
We present an iterative, training-free, plug-and-play framework for visually-grounded multimodal reasoning.<n>Our key idea is to supervise each reasoning step at test time with visual evidence.<n>Our method achieves 16.5%-29.5% improvements on TreeBench and 13.7% RH-AUC gains on RH-Bench.
arXiv Detail & Related papers (2026-02-25T02:13:59Z)
ThinkRL-Edit: Thinking in Reinforcement Learning for Reasoning-Centric Image Editing [33.888289858260706]
Reinforcement learning (RL) has been investigated for improving the quality of image editing.<n>RL faces three key challenges: (1) limited reasoning exploration confined to denoising, (2) biased reward fusion, and (3) unstable VLM-based instruction rewards.<n>We propose ThinkRL-Edit, a reasoning-centric RL framework that decouples visual reasoning from image synthesis.
arXiv Detail & Related papers (2026-01-06T23:43:00Z)
Monet: Reasoning in Latent Visual Space Beyond Images and Language [55.424507246294326]
"Thinking with images" has emerged as an effective paradigm for advancing visual reasoning.<n>Existing methods fall short of human-like abstract visual thinking.<n>We introduce Monet, a training framework that enables multimodal large language models to reason directly within the latent visual space.
arXiv Detail & Related papers (2025-11-26T13:46:39Z)
More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models [17.431298099935344]
Reasoning has emerged as a pivotal capability in Large Language Models (LLMs)<n>Recent research has sought to extend reasoning to Vision-Language Models (VLMs)<n>Our study uncovers the dual nature of multimodal reasoning, leading to recognition failures on otherwise basic visual questions.<n>We propose Vision-Anchored Policy Optimization (VAPO), a simple yet effective method that explicitly steers the reasoning process toward visually grounded trajectories.
arXiv Detail & Related papers (2025-09-30T06:37:47Z)
DeFacto: Counterfactual Thinking with Images for Enforcing Evidence-Grounded and Faithful Reasoning [11.952788515297913]
DeFacto is a counterfactual reasoning framework that jointly enforces accurate answering and faithful reasoning.<n>We develop a pipeline that automatically localizes question-relevant evidence and constructs positive, counterfactual, and random variants.<n> Experiments on diverse benchmarks demonstrate that DeFacto substantially improves both answer accuracy and reasoning faithfulness.
arXiv Detail & Related papers (2025-09-25T08:58:10Z)
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search [85.201906907271]
Mini-o3 is a system that executes deep, multi-turn reasoning spanning tens of steps.<n>Our recipe for reproducing OpenAI o3-style behaviors comprises three key components.<n>Extensive experiments demonstrate that Mini-o3 produces rich reasoning patterns and deep thinking paths.
arXiv Detail & Related papers (2025-09-09T17:54:21Z)
Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning [95.44766931218896]
Multi-modal large language models (MLLMs) still lag behind text-based reasoning.<n>We introduce Perception-Reasoning Decoupling, which modularizes the MLLM's reasoning component and makes it easily replaceable.<n>We propose a novel reinforcement learning algorithm called Visual Perception Optimization (VPO) to align the MLLM's perceptual output with the final reasoning task.
arXiv Detail & Related papers (2025-06-05T02:28:07Z)
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning [37.194825644787294]
We train visual language models (VLMs) to perform reasoning on image data through reinforcement learning and visual question-answer pairs.<n>Our model, named Visionary-R1, outperforms strong multimodal models on multiple visual reasoning benchmarks.
arXiv Detail & Related papers (2025-05-20T17:58:35Z)
Landscape of Thoughts: Visualizing the Reasoning Process of Large Language Models [58.64449765678416]
We introduce landscape of thoughts (LoT) to inspect the reasoning trajectories with certain reasoning methods on any multi-choice dataset.<n>LoT distinguishes between strong and weak models, correct and incorrect answers, as well as different reasoning tasks.<n>We showcase this advantage by adapting LoT to a lightweight verifier that evaluates the correctness of trajectories.
arXiv Detail & Related papers (2025-03-28T06:09:51Z)
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles [91.88062410741833]
We introduce OpenVLThinker, one of the first open-source large vision-language models (LVLMs) to exhibit sophisticated chain-of-thought reasoning.<n>We show that OpenVLThinker-7B consistently advances performance across six benchmarks demanding mathematical and general reasoning.
arXiv Detail & Related papers (2025-03-21T17:52:43Z)
Mitigating Visual Forgetting via Take-along Visual Conditioning for Multi-modal Long CoT Reasoning [53.790502697674754]
We propose Take-along Visual Conditioning (TVC), a strategy that shifts image input to critical reasoning stages.<n>TVC helps the model retain attention to the visual components throughout the reasoning.<n>Our approach achieves state-of-the-art performance on average across five mathematical reasoning benchmarks.
arXiv Detail & Related papers (2025-03-17T16:45:12Z)
Retrieval-Based Interleaved Visual Chain-of-Thought in Real-World Driving Scenarios [69.00444996464662]
We propose RIV-CoT, a Retrieval-Based Interleaved Visual Chain-of-Thought method that enables vision-language models to reason using visual crops corresponding to relevant entities.<n>Our experiments demonstrate that RIV-CoT improves answer accuracy by 3.1% and reasoning accuracy by 4.6% over vanilla CoT prompting.
arXiv Detail & Related papers (2025-01-08T18:31:16Z)
Cantor: Inspiring Multimodal Chain-of-Thought of MLLM [83.6663322930814]
We argue that converging visual context acquisition and logical reasoning is pivotal for tackling visual reasoning tasks. We propose an innovative multimodal CoT framework, termed Cantor, characterized by a perception-decision architecture. Our experiments demonstrate the efficacy of the proposed framework, showing significant improvements in multimodal CoT performance.
arXiv Detail & Related papers (2024-04-24T17:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.