Thinking with Images via Self-Calling Agent
- URL: http://arxiv.org/abs/2512.08511v2
- Date: Thu, 11 Dec 2025 13:21:56 GMT
- Title: Thinking with Images via Self-Calling Agent
- Authors: Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye,
- Abstract summary: Self-Calling Chain-of-Thought (sCoT) is a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling.<n> Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours.
- Score: 43.48244527974193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.
Related papers
- Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z) - ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better [59.29940512530982]
We propose ChainV, a framework that dynamically integrates visual hints into the reasoning process.<n>Our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks.
arXiv Detail & Related papers (2025-11-21T10:11:17Z) - Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding [23.138205646078536]
Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks.<n>We find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks.<n>We propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union rewards.
arXiv Detail & Related papers (2025-11-17T21:22:50Z) - Improving Chain-of-Thought Efficiency for Autoregressive Image Generation [55.57836819892392]
We introduce ShortCoTI, a lightweight optimization framework for image generation.<n>ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task.<n>Our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich.
arXiv Detail & Related papers (2025-10-07T05:40:43Z) - Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision [30.155319213322013]
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs)<n>We propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning.
arXiv Detail & Related papers (2025-08-07T17:45:17Z) - PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z) - Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z) - Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning [45.517215214938844]
Chain-of-thought technique has been received well in multi-modal tasks.
We propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning.
arXiv Detail & Related papers (2024-04-06T07:39:44Z) - Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework.
Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.