Related papers: AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios

URL: http://arxiv.org/abs/2602.23166v2
Date: Mon, 02 Mar 2026 03:29:34 GMT
Title: AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios
Authors: Zhaochen Su, Jincheng Gao, Hangyu Guo, Zhenhua Liu, Lueyang Zhang, Xinyu Geng, Shijue Huang, Peng Xia, Guanyu Jiang, Cheng Wang, Yue Zhang, Yi R. Fung, Junxian He,
Abstract summary: Real-world multimodal agents solve multi-step grounded in visual evidence.<n>Existing benchmarks mainly evaluate single-turn visual reasoning or specific tool skills.<n>We introduce AgentVista, a benchmark for generalist multimodal agents.
Score: 32.58358574768901
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Real-world multimodal agents solve multi-step workflows grounded in visual evidence. For example, an agent can troubleshoot a device by linking a wiring photo to a schematic and validating the fix with online documentation, or plan a trip by interpreting a transit map and checking schedules under routing constraints. However, existing multimodal benchmarks mainly evaluate single-turn visual reasoning or specific tool skills, and they do not fully capture the realism, visual subtlety, and long-horizon tool use that practical agents require. We introduce AgentVista, a benchmark for generalist multimodal agents that spans 25 sub-domains across 7 categories, pairing realistic and detail-rich visual scenarios with natural hybrid tool use. Tasks require long-horizon tool interactions across modalities, including web search, image search, page navigation, and code-based operations for both image processing and general programming. Comprehensive evaluation of state-of-the-art models exposes significant gaps in their ability to carry out long-horizon multimodal tool use. Even the best model in our evaluation, Gemini-3-Pro with tools, achieves only 27.3% overall accuracy, and hard instances can require more than 25 tool-calling turns. We expect AgentVista to accelerate the development of more capable and reliable multimodal agents for realistic and ultra-challenging problem solving.

Related papers

BrowseComp-$V^3$: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents [30.849897676091327]
Multimodal large language models (MLLMs) are evolving into autonomous agents capable of performing multimodal web browsing and deep search in open-world environments.<n>We introduce BrowseComp-$V3$, a novel benchmark consisting of 300 carefully curated and challenging questions spanning diverse domains.<n>Our results highlight a fundamental gap between current model capabilities and robust multimodal deep search in real-world settings.
arXiv Detail & Related papers (2026-02-13T12:25:13Z)
InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search [48.79494320593913]
We introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details.<n>O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning.<n>We propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher)
arXiv Detail & Related papers (2025-12-21T14:23:07Z)
Training Multi-Image Vision Agents via End2End Reinforcement Learning [51.81337984526068]
We propose IMAgent, an open-source vision agent trained via end-to-end reinforcement learning.<n>By leveraging a multi-agent system, we generate challenging and visually-rich multi-image QA pairs.<n>We develop two specialized tools for visual reflection and confirmation, allowing the model to proactively reallocate its attention to image content.
arXiv Detail & Related papers (2025-12-05T10:02:38Z)
DeepEyesV2: Toward Agentic Multimodal Model [3.775371242454792]
Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning.<n>We introduce DeepEyesV2, and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation.<n>We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks.
arXiv Detail & Related papers (2025-11-07T14:31:20Z)
The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution [86.4588675093384]
Toolathlon is a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation.<n>This benchmark includes 108 manually sourced or crafted tasks, requiring interacting with multiple Apps over around 20 turns on average to complete.<n>We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.
arXiv Detail & Related papers (2025-10-29T17:32:49Z)
MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents [78.3863007028688]
MM-BrowseComp is a novel benchmark comprising 224 challenging, hand-crafted questions.<n>These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages.<n>Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02% accuracy.
arXiv Detail & Related papers (2025-08-14T13:46:47Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks [93.85005277463802]
VisualWebArena is a benchmark designed to assess the performance of multimodal web agents on realistic tasks. To perform on this benchmark, agents need to accurately process image-text inputs, interpret natural language instructions, and execute actions on websites to accomplish user-defined objectives.
arXiv Detail & Related papers (2024-01-24T18:35:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.