CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
- URL: http://arxiv.org/abs/2512.17312v1
- Date: Fri, 19 Dec 2025 07:52:23 GMT
- Title: CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning
- Authors: Qi Song, Honglin Li, Yingchen Yu, Haoyi Zhou, Lin Yang, Song Bai, Qi She, Zilong Huang, Yunqing Zhao,
- Abstract summary: We introduce CodeDance, which explores executable code as a general solver for visual reasoning.<n>CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts.<n>We show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models.
- Score: 47.30236915430168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent releases such as o3 highlight human-like "thinking with images" reasoning that combines structured tool use with stepwise verification, yet most open-source approaches still rely on text-only chains, rigid visual schemas, or single-step pipelines, limiting flexibility, interpretability, and transferability on complex tasks. We introduce CodeDance, which explores executable code as a general solver for visual reasoning. Unlike fixed-schema calls (e.g., only predicting bounding-box coordinates), CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts (e.g., boxes, lines, plots) that support transparent, self-checkable reasoning. To guide this process, we introduce a reward for balanced and adaptive tool-call, which balances exploration with efficiency and mitigates tool overuse. Interestingly, beyond the expected capabilities taught by atomic supervision, we empirically observe novel emergent behaviors during RL training: CodeDance demonstrates novel tool invocations, unseen compositions, and cross-task transfer. These behaviors arise without task-specific fine-tuning, suggesting a general and scalable mechanism of executable visual reasoning. Extensive experiments across reasoning benchmarks (e.g., visual search, math, chart QA) show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models such as GPT-4o and larger open-source models.
Related papers
- ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z) - ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z) - MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z) - Latent Implicit Visual Reasoning [59.39913238320798]
We propose a task-agnostic mechanism that trains LMMs to discover and use visual reasoning tokens without explicit supervision.<n>Our approach outperforms direct fine-tuning and achieves state-of-the-art results on a diverse range of vision-centric tasks.
arXiv Detail & Related papers (2025-12-24T14:59:49Z) - SIT-Graph: State Integrated Tool Graph for Multi-Turn Agents [35.85800795225018]
State Integrated Tool Graph (SIT-Graph) is inspired by human decision-making that integrates episodic and procedural memory.<n>At inference time, SIT-Graph enables a human-like balance between episodic recall and procedural execution.<n> Experiments across multiple stateful multi-turn tool-use benchmarks show that SIT-Graph consistently outperforms strong memory- and graph-based baselines.
arXiv Detail & Related papers (2025-12-08T08:27:24Z) - Thinking with Programming Vision: Towards a Unified View for Thinking with Images [23.596757163808906]
We show that even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions.<n>We propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation.
arXiv Detail & Related papers (2025-12-03T12:44:15Z) - CoSineVerifier: Tool-Augmented Answer Verification for Computation-Oriented Scientific Questions [32.14674040685995]
We introduce model, a tool-augmented verifier that leverages external rubrics to perform precise computations and symbolic simplifications.<n>Experiments conducted on STEM subjects, general QA, and long-form reasoning tasks demonstrates strong generalization of model.
arXiv Detail & Related papers (2025-12-01T03:08:43Z) - CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization [11.951768962241713]
We show that high final-answer accuracy often hides unfaithful visual reasoning.<n>We introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization.
arXiv Detail & Related papers (2025-11-24T19:48:46Z) - RECODE: Reasoning Through Code Generation for Visual Question Answering [68.86938437188964]
We propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning.<n>Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
arXiv Detail & Related papers (2025-10-15T17:05:37Z) - Visual Jigsaw Post-Training Improves MLLMs [58.29961336087896]
We introduce Visual Jigsaw, a generic self-supervised post-training framework designed to strengthen visual understanding in large language models (MLLMs)<n>Visual Jigsaw is formulated as a general ordering task: visual inputs are partitioned, shuffled, and the model must reconstruct the visual information by producing the correct permutation in natural language.<n>Extensive experiments demonstrate substantial improvements in fine-grained perception, temporal reasoning, and 3D spatial understanding.
arXiv Detail & Related papers (2025-09-29T17:59:57Z) - Instance-Aware Graph Prompt Learning [71.26108600288308]
We introduce Instance-Aware Graph Prompt Learning (IA-GPL) in this paper.
The process involves generating intermediate prompts for each instance using a lightweight architecture.
Experiments conducted on multiple datasets and settings showcase the superior performance of IA-GPL compared to state-of-the-art baselines.
arXiv Detail & Related papers (2024-11-26T18:38:38Z) - CausalVLR: A Toolbox and Benchmark for Visual-Linguistic Causal
Reasoning [107.81733977430517]
CausalVLR (Causal Visual-Linguistic Reasoning) is an open-source toolbox containing a rich set of state-of-the-art causal relation discovery and causal inference methods.
These methods have been included in the toolbox with PyTorch implementations under NVIDIA computing system.
arXiv Detail & Related papers (2023-06-30T08:17:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.