Related papers: Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization

URL: http://arxiv.org/abs/2511.22586v1
Date: Thu, 27 Nov 2025 16:19:34 GMT
Title: Revisiting the Necessity of Lengthy Chain-of-Thought in Vision-centric Reasoning Generalization
Authors: Yifan Du, Kun Zhou, Yingqian Min, Yue Ling, Wayne Xin Zhao, Youbin Wu,
Abstract summary: We study how different Chain-of-language (CoT) designs affect the acquisition of the generalizable visual reasoning ability.<n>We compare three representative CoT formats: Language CoT, Grounding CoT, and Visual CoT.<n>Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling.
Score: 55.6995787502694
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study how different Chain-of-Thought (CoT) designs affect the acquisition of the generalizable visual reasoning ability in vision-language models (VLMs). While CoT data, especially long or visual CoT such as "think with image", has been widely used to supervise intermediate reasoning, it remains unclear why specific CoT designs help and which ones truly support generalizable reasoning. To systematically evaluate this, we focus on a controlled maze-solving benchmark where reasoning rules are fully visual, difficulty can be tuned by grid size, and all the intermediate steps can be automatically generated. Using Qwen2.5-VL-7B under a standard SFT-then-RL pipeline, we compare three representative CoT formats: Language CoT, Grounding CoT (with spatial coordinate trajectories), and Visual CoT (with image manipulations). Our experiments reveal that visual and longer CoT mainly accelerate convergence but do not lift the final performance ceiling; concise CoT containing only essential grounding steps outperforms longer traces; and, strikingly, CoT retaining only the minimal grounding results generalizes best across different maze sizes. We further validate these insights on other vision-centric tasks. These findings highlight a "short is long" effect and provide practical guidance for constructing more generalizable SFT datasets for visual reasoning.

Related papers

On the Out-of-Distribution Generalization of Reasoning in Multimodal LLMs for Simple Visual Planning Tasks [56.98385132295952]
We evaluate how well chain-of-thought approaches generalize on a simple planning task.<n>We find that reasoning traces which combine multiple text formats yield the best (and non-trivial) OOD generalization.<n> purely text-based models consistently outperform those utilizing image-based inputs.
arXiv Detail & Related papers (2026-02-17T09:51:40Z)
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation [11.18316873483782]
Vision-and-Language Navigation (VLN) requires an embodied agent to jointly understand multimodal instructions and visual-spatial context.<n>Recent works demonstrate the potential of Chain-of-Thought (CoT) reasoning for improving interpretability and long-horizon planning.<n>We propose FantasyVLN, a unified implicit reasoning framework that preserves the benefits of CoT reasoning without explicit token overhead.
arXiv Detail & Related papers (2026-01-20T13:54:10Z)
Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z)
CIR-CoT: Towards Interpretable Composed Image Retrieval via End-to-End Chain-of-Thought Reasoning [93.05917922306196]
Composed Image Retrieval (CIR) aims to find a target image from a reference image and a modification text.<n>CIR-CoT is the first end-to-end retrieval-oriented MLLM designed to integrate explicit Chain-of-Thought (CoT) reasoning.
arXiv Detail & Related papers (2025-10-09T09:41:45Z)
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision [30.155319213322013]
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs)<n>We propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning.
arXiv Detail & Related papers (2025-08-07T17:45:17Z)
Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.81815833343026]
We introduce Point-RFT, a multimodal reasoning framework explicitly designed to leverage visually grounded CoT reasoning for visual document understanding.<n>Our approach consists of two stages: First, we conduct format finetuning using a curated dataset of 71K diverse visual reasoning problems, each annotated with detailed, step-by-step rationales explicitly grounded to corresponding visual elements.<n>On ChartQA, our approach improves accuracy from 70.88% (language-finetuned baseline) to 90.04%, surpassing the 83.92% accuracy achieved by reinforcement finetuning relying solely on text-based CoT.
arXiv Detail & Related papers (2025-05-26T08:54:14Z)
Reasoning by Superposition: A Theoretical Perspective on Chain of Continuous Thought [64.43689151961054]
We prove that a two-layer transformer with $D$ steps of continuous CoTs can solve the directed graph reachability problem.<n>In our construction, each continuous thought vector is a superposition state that encodes multiple search frontiers simultaneously.
arXiv Detail & Related papers (2025-05-18T18:36:53Z)
When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.