Related papers: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications

URL: http://arxiv.org/abs/2509.26490v2
Date: Fri, 17 Oct 2025 08:04:19 GMT
Title: VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications
Authors: Wei He, Yueqing Sun, Hongyan Hao, Xueyuan Hao, Zhikang Xia, Qi Gu, Chengcheng Han, Dengchang Zhao, Hui Su, Kefeng Zhang, Man Gao, Xi Su, Xiaodong Cai, Xunliang Cai, Yu Yang, Yunke Zhao,
Abstract summary: We introduce VitaBench, a benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings.<n>VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools.<n>Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks.
Score: 20.065087936770215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As LLM-based agents are increasingly deployed in real-life scenarios, existing benchmarks fail to capture their inherent complexity of handling extensive information, leveraging diverse resources, and managing dynamic user interactions. To address this gap, we introduce VitaBench, a challenging benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings. Drawing from daily applications in food delivery, in-store consumption, and online travel services, VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools. Through a framework that eliminates domain-specific policies, we enable flexible composition of these scenarios and tools, yielding 100 cross-scenario tasks (main results) and 300 single-scenario tasks. Each task is derived from multiple real user requests and requires agents to reason across temporal and spatial dimensions, utilize complex tool sets, proactively clarify ambiguous instructions, and track shifting user intent throughout multi-turn conversations. Moreover, we propose a rubric-based sliding window evaluator, enabling robust assessment of diverse solution pathways in complex environments and stochastic interactions. Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks, and less than 50% success rate on others. Overall, we believe VitaBench will serve as a valuable resource for advancing the development of AI agents in practical real-world applications. The code, dataset, and leaderboard are available at https://vitabench.github.io/

Related papers

Agent World Model: Infinity Synthetic Environments for Agentic Reinforcement Learning [62.499592503950026]
Large language model (LLM) have empowered autonomous agents to perform complex tasks that require multi-turn interactions with tools and environments.<n>We propose Agent World Model (AWM), a fully synthetic environment generation pipeline.<n>We scale to 1,000 environments covering everyday scenarios, in which agents can interact with rich toolsets.
arXiv Detail & Related papers (2026-02-10T18:55:41Z)
TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios [12.553634759736601]
TRIP-Bench is a long-horizon benchmark grounded in realistic travel-planning scenarios.<n> Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context.<n>Experiments show that even advanced models achieve at most 50% success on the easy split, with performance dropping below 10% on hard subsets.
arXiv Detail & Related papers (2026-02-02T05:43:08Z)
Trajectory2Task: Training Robust Tool-Calling Agents with Synthesized Yet Verifiable Data for Complex User Intents [52.30603055218294]
Trajectory2Task is a verifiable data generation pipeline for studying tool use at scale under three realistic user scenarios.<n>It converts valid tool-call trajectories into user-facing tasks with controlled intent adaptations.<n>We benchmark seven state-of-the-art LLMs on the generated complex user scenario tasks and observe frequent failures.
arXiv Detail & Related papers (2026-01-28T00:36:13Z)
TongSIM: A General Platform for Simulating Intelligent Machines [59.27575233453533]
Embodied intelligence focuses on training agents within realistic simulated environments.<n>TongSIM is a high-fidelity, general-purpose platform for training and evaluating embodied agents.
arXiv Detail & Related papers (2025-12-23T10:00:43Z)
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows [10.318744035680398]
Large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon reasoning.<n>OdysseyBench is a comprehensive benchmark for evaluating LLM agents on long-horizon across diverse office applications.<n>To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks.
arXiv Detail & Related papers (2025-08-12T17:53:03Z)
What Limits Virtual Agent Application? OmniBench: A Scalable Multi-Dimensional Benchmark for Essential Virtual Agent Capabilities [56.646832992178105]
We introduce OmniBench, a cross-platform, graph-based benchmark with an automated pipeline for synthesizing tasks of controllable complexity.<n>We present OmniEval, a multidimensional evaluation framework that includes subtask-level evaluation, graph-based metrics, and comprehensive tests across 10 capabilities.<n>Our dataset contains 36k graph-structured tasks across 20 scenarios, achieving a 91% human acceptance rate.
arXiv Detail & Related papers (2025-06-10T15:59:38Z)
LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.78866929908871]
Large Action Models (LAMs) for AI Agents offer incredible potential but face challenges due to the need for high-quality training data.<n>We present LAM SIMULATOR, a comprehensive framework designed for online exploration of agentic tasks with high-quality feedback.<n>Our framework features a dynamic task query generator, an extensive collection of tools, and an interactive environment where Large Language Model (LLM) Agents can call tools and receive real-time feedback.
arXiv Detail & Related papers (2025-06-02T22:36:02Z)
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions [12.218102495632937]
Large language models (LLMs) demonstrate strong potential as agents for tool invocation due to their advanced comprehension and planning capabilities.<n>We propose the Multi-Mission Tool Bench. In the benchmark, each test case comprises multiple interrelated missions.<n>We also propose a novel method to evaluate the accuracy and efficiency of agent decisions with dynamic decision trees.
arXiv Detail & Related papers (2025-04-03T14:21:33Z)
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation [89.24729958546168]
Smartphone agents are increasingly important for helping users control devices efficiently.<n>We present SPA-Bench, a comprehensive SmartPhone Agent Benchmark designed to evaluate (M)LLM-based agents.
arXiv Detail & Related papers (2024-10-19T17:28:48Z)
CRAB: Cross-environment Agent Benchmark for Multimodal Language Model Agents [49.68117560675367]
Crab is the first benchmark framework designed to support cross-environment tasks.<n>Our framework supports multiple devices and can be easily extended to any environment with a Python interface.<n>The experimental results demonstrate that the single agent with GPT-4o achieves the best completion ratio of 38.01%.
arXiv Detail & Related papers (2024-07-01T17:55:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.