Related papers: CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning

URL: http://arxiv.org/abs/2512.17312v1
Date: Fri, 19 Dec 2025 07:52:23 GMT
Title: CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
Authors: Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao,
Abstract summary: We introduce CodeDance, which explores executable code as a general solver for visual reasoning.<n>CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts.<n>We show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models.
Score: 47.30236915430168
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.

Related papers

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z)
ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z)
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z)
Latent Implicit Visual Reasoning [59.39913238320798]
We propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision.<n>Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks.
arXiv Detail & Related papers (2025-12-24T14:59:49Z)
SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents [35.85800795225018]
State Integrated Tool Graph (SIT-Graph) is inspired by human decision-making that integrates episodic and procedural memory.<n>At inference time, SIT-Graph enables a human-like balance between episodic recall and procedural execution.<n> Experiments across multiple stateful multi-turn tool-use benchmarks show that SIT-Graph consistently outperforms strong memory- and graph-based baselines.
arXiv Detail & Related papers (2025-12-08T08:27:24Z)
Thinking with Programming Vision: Towards a Unified View for Thinking with Images [23.596757163808906]
We show that even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions.<n>We propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation.
arXiv Detail & Related papers (2025-12-03T12:44:15Z)
CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions [32.14674040685995]
We introduce model, a tool-augmented verifier that leverages external rubrics to perform precise computations and symbolic simplifications.<n>Experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of model.
arXiv Detail & Related papers (2025-12-01T03:08:43Z)
CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization [11.951768962241713]
We show that high final-answer accuracy often hides unfaithful visual reasoning.<n>We introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization.
arXiv Detail & Related papers (2025-11-24T19:48:46Z)
RECODE: Reasoning Through Code Generation for Visual Question Answering [68.86938437188964]
We propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning.<n>Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
arXiv Detail & Related papers (2025-10-15T17:05:37Z)
Visual Jigsaw Post-Training Improves MLLMs [58.29961336087896]
We introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in large language models (MLLMs)<n>Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language.<n>Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding.
arXiv Detail & Related papers (2025-09-29T17:59:57Z)
Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper. The process involves generating intermediate prompts for each instance using a lightweight architecture. Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z)
CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal Reasoning [107.81733977430517]
CausalVLR (Causal Visual-Linguistic Reasoning) is an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods. These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system.
arXiv Detail & Related papers (2023-06-30T08:17:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.