Related papers: CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization

URL: http://arxiv.org/abs/2511.19661v1
Date: Mon, 24 Nov 2025 19:48:46 GMT
Title: CodeV: Code with Images for Faithful Visual Reasoning via Tool-Aware Policy Optimization
Authors: Xinhai Hou, Shaoyuan Xu, Manan Biyani, Mayan Li, Jia Liu, Todd C. Hollon, Bryan Wang,
Abstract summary: We show that high final-answer accuracy often hides unfaithful visual reasoning.<n>We introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization.
Score: 11.951768962241713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic vision-language models are increasingly trained to "think with images" by calling image operations. However, we show that high final-answer accuracy often hides unfaithful visual reasoning: models may invoke tools on irrelevant regions or ignore tool outputs entirely, yet still guess the correct answer. In this work, we first propose a faithfulness evaluation protocol that measures whether intermediate visual tool outputs (e.g., crops) actually contain the queried evidence. This reveals that recent visual agents achieve high final-answer accuracy but exhibit low rates of faithful tool-use on visual search benchmarks. We then introduce CodeV, a code-based visual agent trained with Tool-Aware Policy Optimization (TAPO). TAPO is a process-level RL framework that augments GRPO with dense rewards defined directly on visual tool inputs and outputs, rather than on chain-of-thought tokens, making supervision easier to verify and less susceptible to reward hacking. CodeV represents visual tools as executable Python code, and TAPO assigns step-wise rewards based solely on the question and tool output, encouraging both necessary and evidence-consistent tool use. In a two-stage SFT+RL pipeline, CodeV achieves competitive or superior accuracy while substantially increasing faithful tool-use rates on related visual search benchmarks. Beyond visual search, CodeV attains strong performance on a range of multimodal reasoning and math benchmarks, suggesting that explicitly supervising intermediate tool behavior is crucial for building trustworthy, agentic visual reasoning systems.

Related papers

ForgeryVCR: Visual-Centric Reasoning via Efficient Forensic Tools in MLLMs for Image Forgery Detection and Localization [62.03035862528452]
ForgeryVCR is a framework that materializes imperceptible traces into explicit visual intermediates via Visual-Centric Reasoning.<n>ForgeryVCR achieves state-of-the-art (SOTA) performance in both detection and localization tasks.
arXiv Detail & Related papers (2026-02-15T11:14:47Z)
GUI-Eyes: Tool-Augmented Perception for Visual Grounding in GUI Agents [39.807839972627015]
We present GUI-Eyes, a reinforcement learning framework for active visual perception in GUI tasks.<n>We introduce a progressive perception strategy that decomposes decision-making into coarse exploration and fine-grained grounding.<n>On the ScreenSpot-Pro benchmark, GUI-Eyes-3B achieves 44.8% grounding accuracy using only 3k labeled samples.
arXiv Detail & Related papers (2026-01-14T14:27:28Z)
CodeDance: A Dynamic Tool-integrated MLLM for Executable Visual Reasoning [47.30236915430168]
We introduce CodeDance, which explores executable code as a general solver for visual reasoning.<n>CodeDance defines, composes, and executes code to orchestrate multiple tools, compute intermediate results, and render visual artifacts.<n>We show that CodeDance not only consistently outperforms schema-driven and text-only baselines, but also surpasses advanced closed models.
arXiv Detail & Related papers (2025-12-19T07:52:23Z)
AdaTooler-V: Adaptive Tool-Use for Images and Videos [36.66944857910871]
AdaTooler-V is an MLLM that performs adaptive tool-use by determining whether a visual problem truly requires tools.<n>AdaTooler-V-7B achieves an accuracy of 89.8% on the high-resolution benchmark V*, surpassing the commercial proprietary model GPT-4o and Gemini 1.5 Pro.
arXiv Detail & Related papers (2025-12-18T18:59:55Z)
Thinking with Programming Vision: Towards a Unified View for Thinking with Images [23.596757163808906]
We show that even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions.<n>We propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation.
arXiv Detail & Related papers (2025-12-03T12:44:15Z)
RECODE: Reasoning Through Code Generation for Visual Question Answering [68.86938437188964]
We propose to leverage derendering -- the process of reverse-engineering visuals into executable code -- as a new modality for verifiable visual reasoning.<n>Our work demonstrates that grounding visual perception in executable code provides a new path toward more accurate and verifiable multimodal reasoning.
arXiv Detail & Related papers (2025-10-15T17:05:37Z)
Reinforced Visual Perception with Tools [66.79840157663237]
We introduce a novel RL algorithm based on GRPO, designed to train models to reason with a suite of four visual tools.<n>We show that our method achieves state-of-the-art performance on several perception-heavy benchmarks.<n>Our ReVPT-3B and ReVPT-7B outperform the instruct models by 9.03% and 9.44% on CV-Bench.
arXiv Detail & Related papers (2025-09-01T17:57:49Z)
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection [47.259066449806866]
VisTA is a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance.<n>We show that VisTA achieves substantial performance gains over training-free baselines.<n>These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
arXiv Detail & Related papers (2025-05-26T17:59:17Z)
VTool-R1: VLMs Learn to Think with Images via Reinforcement Learning on Multimodal Tool Use [33.83255323522487]
We introduce VTool-R1, the first framework that trains vision-language models to generate multimodal chains of thought.<n>VTool-R1 integrates Python-based visual editing tools into theReinforcement Learning Finetuning process.
arXiv Detail & Related papers (2025-05-25T18:23:39Z)
Visual Agentic Reinforcement Fine-Tuning [73.37007472426299]
This work highlights the effectiveness of Visual Agentic Reinforcement Fine-Tuning (Visual-ARFT) for enabling flexible and adaptive reasoning abilities for Large Vision-Language Models (LVLMs)<n>With Visual-ARFT, open-source LVLMs gain the ability to browse websites for real-time information updates and write code to manipulate and analyze input images through cropping, rotation, and other image processing techniques.<n>Our experimental results demonstrate that Visual-ARFT outperforms its baseline by +18.6% F1 / +13.0% EM on MAT-Coding and +10.3% F1 / +8.7% EM on MAT-Search
arXiv Detail & Related papers (2025-05-20T11:59:25Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
Acting Less is Reasoning More! Teaching Model to Act Efficiently [87.28134636548705]
Tool-integrated reasoning augments large language models with the ability to invoke external tools to solve tasks.<n>Current approaches typically optimize only for final correctness without considering the efficiency or necessity of external tool use.<n>We propose a framework that encourages models to produce accurate answers with minimal tool calls.<n>Our approach reduces tool calls by up to 68.3% and improves tool productivity by up to 215.4%, while maintaining comparable answer accuracy.
arXiv Detail & Related papers (2025-04-21T05:40:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.