Related papers: Thinking with Images via Self-Calling Agent

Thinking with Images via Self-Calling Agent

URL: http://arxiv.org/abs/2512.08511v2
Date: Thu, 11 Dec 2025 13:21:56 GMT
Title: Thinking with Images via Self-Calling Agent
Authors: Wenxi Yang, Yuzhong Zhao, Fang Wan, Qixiang Ye,
Abstract summary: Self-Calling Chain-of-Thought (sCoT) is a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling.<n> Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9%$ with $sim 75%$ fewer GPU hours.
Score: 43.48244527974193
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Thinking-with-images paradigms have showcased remarkable visual reasoning capability by integrating visual information as dynamic elements into the Chain-of-Thought (CoT). However, optimizing interleaved multimodal CoT (iMCoT) through reinforcement learning remains challenging, as it relies on scarce high-quality reasoning data. In this study, we propose Self-Calling Chain-of-Thought (sCoT), a novel visual reasoning paradigm that reformulates iMCoT as a language-only CoT with self-calling. Specifically, a main agent decomposes the complex visual reasoning task to atomic subtasks and invokes its virtual replicas, i.e. parameter-sharing subagents, to solve them in isolated context. sCoT enjoys substantial training effectiveness and efficiency, as it requires no explicit interleaving between modalities. sCoT employs group-relative policy optimization to reinforce effective reasoning behavior to enhance optimization. Experiments on HR-Bench 4K show that sCoT improves the overall reasoning performance by up to $1.9\%$ with $\sim 75\%$ fewer GPU hours compared to strong baseline approaches. Code is available at https://github.com/YWenxi/think-with-images-through-self-calling.

Related papers

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification? [18.16727716373833]
Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC)<n>We propose ReFine-RFT, a framework that combines ensemble rewards with alg to constrain reasoning length while providing dense accuracy-oriented feedback.
arXiv Detail & Related papers (2026-01-11T17:07:47Z)
ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better [59.29940512530982]
We propose ChainV, a framework that dynamically integrates visual hints into the reasoning process.<n>Our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks.
arXiv Detail & Related papers (2025-11-21T10:11:17Z)
Start Small, Think Big: Curriculum-based Relative Policy Optimization for Visual Grounding [23.138205646078536]
Chain-of-Thought (CoT) prompting has recently shown significant promise across various NLP and computer vision tasks.<n>We find that reinforcement learning (RL)-based fine-tuned CoT reasoning can paradoxically degrade performance in Visual Grounding tasks.<n>We propose Curriculum-based Relative Policy Optimization (CuRPO), a novel training strategy that leverages CoT length and generalized Intersection over Union rewards.
arXiv Detail & Related papers (2025-11-17T21:22:50Z)
Improving Chain-of-Thought Efficiency for Autoregressive Image Generation [55.57836819892392]
We introduce ShortCoTI, a lightweight optimization framework for image generation.<n>ShortCoTI rewards more concise prompts with an adaptive function that scales according to an estimated difficulty for each task.<n>Our method eliminates verbose explanations and repetitive refinements, producing reasoning prompts that are both concise and semantically rich.
arXiv Detail & Related papers (2025-10-07T05:40:43Z)
Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision [30.155319213322013]
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs)<n>We propose Uni-CoT, a Unified Chain-of-Thought framework that enables coherent and grounded multimodal reasoning.
arXiv Detail & Related papers (2025-08-07T17:45:17Z)
PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z)
Will Pre-Training Ever End? A First Step Toward Next-Generation Foundation MLLMs via Self-Improving Systematic Cognition [89.50068130832635]
Self-Improving cognition (SIcog) is a self-learning framework for constructing next-generation foundation MLLMs by multimodal knowledge.<n>We propose Chain-of-Description for step-by-step visual understanding and integrate structured Chain-of-Thought (CoT) reasoning to support in-depth multimodal reasoning.<n>Experiments demonstrate SIcog's effectiveness in developing MLLMs with enhanced multimodal cognition.
arXiv Detail & Related papers (2025-03-16T00:25:13Z)
Soft-Prompting with Graph-of-Thought for Multi-modal Representation Learning [45.517215214938844]
Chain-of-thought technique has been received well in multi-modal tasks. We propose a novel Aggregation-Graph-of-Thought (AGoT) mechanism for soft-prompt tuning in multi-modal representation learning.
arXiv Detail & Related papers (2024-04-06T07:39:44Z)
Multimodal Chain-of-Thought Reasoning in Language Models [94.70184390935661]
We propose Multimodal-CoT that incorporates language (text) and vision (images) modalities into a two-stage framework. Experimental results on ScienceQA and A-OKVQA benchmark datasets show the effectiveness of our proposed approach.
arXiv Detail & Related papers (2023-02-02T07:51:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.