Related papers: DeepEyesV2: Toward Agentic Multimodal Model

DeepEyesV2: Toward Agentic Multimodal Model

URL: http://arxiv.org/abs/2511.05271v2
Date: Mon, 10 Nov 2025 15:43:16 GMT
Title: DeepEyesV2: Toward Agentic Multimodal Model
Authors: Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu,
Abstract summary: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning.<n>We introduce DeepEyesV2, and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation.<n>We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks.
Score: 3.775371242454792
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

Related papers

ToolTok: Tool Tokenization for Efficient and Generalizable GUI Agents [16.06309106596998]
ToolTok is a novel paradigm of multi-step pathfinding for GUI agents.<n>We devise tools aligned with human interaction habits and represent each tool using learnable token embeddings.<n>We construct an easy-to-hard curriculum consisting of three tasks: token definition question-answering, pure text-guided tool selection, and simplified visual pathfinding.
arXiv Detail & Related papers (2026-01-30T08:38:05Z)
AdaReasoner: Dynamic Tool Orchestration for Iterative Visual Reasoning [66.24374176797075]
We introduce textbfAdaReasoner, a family of multimodal models that learn tool use as a general reasoning skill rather than as tool-specific or explicitly supervised behavior.<n>AdaReasoner is enabled by (i) a scalable data curation pipeline exposing models to long-horizon, multi-step tool interactions; (ii) Tool-GRPO, a reinforcement learning algorithm that prioritizes tool selection and sequencing based on end-task success; and (iii) an adaptive learning mechanism that dynamically regulates tool usage.
arXiv Detail & Related papers (2026-01-26T16:04:43Z)
MindWatcher: Toward Smarter Multimodal Tool-Integrated Reasoning [55.221850286246]
We introduce MindWatcher, a tool-integrated reasoning agent with interleaved thinking and multimodal chain-of-thought (CoT) reasoning.<n>MindWatcher can autonomously decide whether and how to invoke diverse tools and coordinate their use.<n>A large-scale, high-quality local image retrieval database, covering eight categories including cars, animals, and plants, endows model with robust object recognition.
arXiv Detail & Related papers (2025-12-29T12:16:12Z)
Code-in-the-Loop Forensics: Agentic Tool Use for Image Forgery Detection [59.04089915447622]
ForenAgent is an interactive IFD framework that enables MLLMs to autonomously generate, execute, and refine Python-based low-level tools around the detection objective.<n>Inspired by human reasoning, we design a dynamic reasoning loop comprising global perception, local focusing, iterative probing, and holistic adjudication.<n>Experiments show that ForenAgent exhibits emergent tool-use competence and reflective reasoning on challenging IFD tasks.
arXiv Detail & Related papers (2025-12-18T08:38:44Z)
Training Multi-Image Vision Agents via End2End Reinforcement Learning [51.81337984526068]
We propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning.<n>By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs.<n>We develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content.
arXiv Detail & Related papers (2025-12-05T10:02:38Z)
SpaceTools: Tool-Augmented Spatial Reasoning via Double Interactive RL [33.692408134748696]
Vision Language Models (VLMs) demonstrate strong qualitative visual understanding, but struggle with metrically precise spatial reasoning.<n>We introduce Double Interactive Reinforcement Learning (DIRL), a two-phase training framework where VLMs learn to coordinate multiple tools.<n>Our model, SpaceTools, with tool-augmented spatial reasoning ability, achieves state-of-the-art performance on spatial understanding benchmarks.
arXiv Detail & Related papers (2025-12-03T18:50:04Z)
WebSeer: Training Deeper Search Agents through Reinforcement Learning with Self-Reflection [51.10348385624784]
We present WebSeer, a more intelligent search agent trained via reinforcement learning enhanced with a self-reflection mechanism.<n>Our approach substantially extends tool-use chains and improves answer accuracy.
arXiv Detail & Related papers (2025-10-21T16:52:00Z)
Tool-Augmented Policy Optimization: Synergizing Reasoning and Adaptive Tool Use with Reinforcement Learning [29.280386584974455]
Recent advances in large language models (LLMs) have popularized test-time scaling, where models generate additional reasoning tokens before producing final answers.<n>These approaches have demonstrated significant performance improvements on benchmarks involving mathematical reasoning.<n>We propose Tool-Augmented Policy Optimization (TAPO), a novel reinforcement learning framework that integrates multi-hop reasoning with adaptive tool-calling capabilities.
arXiv Detail & Related papers (2025-10-08T14:04:27Z)
Adaptive Tool Generation with Models as Tools and Reinforcement Learning [3.592245101862886]
MTR is a simulation-first training framework for tool-augmented reasoning.<n>It learns from complete ReAct traces with schema-validated, simulated observations.<n>MTR attains competitive Exact Match (EM) scores to live-API systems.
arXiv Detail & Related papers (2025-10-08T09:48:50Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
VisualToolAgent (VisTA): A Reinforcement Learning Framework for Visual Tool Selection [47.259066449806866]
VisTA is a new reinforcement learning framework that empowers visual agents to dynamically explore, select, and combine tools from a diverse library based on empirical performance.<n>We show that VisTA achieves substantial performance gains over training-free baselines.<n>These results highlight VisTA's ability to enhance generalization, adaptively utilize diverse tools, and pave the way for flexible, experience-driven visual reasoning systems.
arXiv Detail & Related papers (2025-05-26T17:59:17Z)
OpenThinkIMG: Learning to Think with Images via Visual Tool Reinforcement Learning [57.89304342666846]
We introduce OpenThinkIMG, the first open-source, comprehensive end-to-end framework for tool-augmented LVLMs.<n>We propose a novel reinforcement learning framework V-ToolRL to train LVLMs to learn adaptive policies for invoking external vision tools.<n>V-ToolRL enables LVLMs to autonomously discover optimal tool-usage strategies.
arXiv Detail & Related papers (2025-05-13T14:35:51Z)
Adaptive Tool Use in Large Language Models with Meta-Cognition Trigger [49.81945268343162]
We propose MeCo, an adaptive decision-making strategy for external tool use.<n>MeCo quantifies metacognitive scores by capturing high-level cognitive signals in the representation space.<n>MeCo is fine-tuning-free and incurs minimal cost.
arXiv Detail & Related papers (2025-02-18T15:45:01Z)
CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update [69.59482029810198]
CLOVA is a Closed-Loop Visual Assistant that operates within a framework encompassing inference, reflection, and learning phases. Results demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing.
arXiv Detail & Related papers (2023-12-18T03:34:07Z)
Towards A Unified Agent with Foundation Models [18.558328028366816]
We investigate how to embed and leverage such abilities in Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring how this enables an agent to tackle a series of fundamental RL challenges. We demonstrate substantial performance improvements over baselines in exploration efficiency and ability to reuse data from offline datasets.
arXiv Detail & Related papers (2023-07-18T22:37:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.