Related papers: LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges

URL: http://arxiv.org/abs/2603.02586v1
Date: Tue, 03 Mar 2026 04:03:05 GMT
Title: LiveAgentBench: Comprehensive Benchmarking of Agentic Systems Across 104 Real-World Challenges
Authors: Hao Li, Huan Wang, Jinjie Gu, Wenjie Wang, Chenyi Zhuang, Sikang Bian,
Abstract summary: We present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements.<n>It is constructed from publicly sourced questions on social media and real-world products.<n>This release includes 374 tasks, with 125 for validation and 249 for testing.
Score: 34.17635007594549
License: http://creativecommons.org/licenses/by/4.0/
Abstract: As large language models grow more capable, general AI agents have become increasingly prevalent in practical applications. However, existing benchmarks face significant limitations, failing to represent real-world user tasks accurately. To address this gap, we present LiveAgentBench, a comprehensive benchmark with 104 scenarios that reflect real user requirements. It is constructed from publicly sourced questions on social media and real-world products. Central to our approach is the Social Perception-Driven Data Generation (SPDG) method, a novel process we developed to ensure each question's real-world relevance, task complexity, and result verifiability. We evaluate various models, frameworks, and commercial products using LiveAgentBench, revealing their practical performance and identifying areas for improvement. This release includes 374 tasks, with 125 for validation and 249 for testing. The SPDG process enables continuous updates with fresh queries from real-world interactions.

Related papers

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces [126.23612941699565]
Terminal-Bench 2.0 is a benchmark composed of 89 tasks in computer terminal environments inspired by problems from real world.<n>We show that frontier models and agents score less than 65% on the benchmark.<n>We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/.
arXiv Detail & Related papers (2026-01-17T01:29:30Z)
ABC-Bench: Benchmarking Agentic Backend Coding in Real-World Development [72.4729759618632]
We introduce ABC-Bench, a benchmark to evaluate agentic backend coding within a realistic, executable workflow.<n>We curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories.<n>Our evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks.
arXiv Detail & Related papers (2026-01-16T08:23:52Z)
VitaBench: Benchmarking LLM Agents with Versatile Interactive Tasks in Real-world Applications [20.065087936770215]
We introduce VitaBench, a benchmark that evaluates agents on versatile interactive tasks grounded in real-world settings.<n>VitaBench presents agents with the most complex life-serving simulation environment to date, comprising 66 tools.<n>Our comprehensive evaluation reveals that even the most advanced models achieve only 30% success rate on cross-scenario tasks.
arXiv Detail & Related papers (2025-09-30T16:33:49Z)
IWR-Bench: Can LVLMs reconstruct interactive webpage from a user interaction video? [56.33950760097989]
IWR-Bench is a novel benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in interactive webpage reconstruction from video.<n>IWR-Bench comprises 113 meticulously curated tasks from 100 real-world websites, with 1,001 actions.<n>This benchmark evaluates models on two fundamental challenges: comprehensive multi-modal reasoning to infer interaction logic from video and assets, and advanced code generation to translate this logic into functional code.
arXiv Detail & Related papers (2025-09-29T12:38:06Z)
MVISU-Bench: Benchmarking Mobile Agents for Real-World Tasks by Multi-App, Vague, Interactive, Single-App and Unethical Instructions [11.021990614727702]
We present textbfMVISU-Bench, a benchmark that includes 404 tasks across 137 mobile applications.<n>We also propose Aider, a plug-and-play module that acts as a dynamic prompt prompter to mitigate risks and clarify user intent for mobile agents.
arXiv Detail & Related papers (2025-08-12T16:18:30Z)
OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks [52.87238755666243]
We present OmniEAR, a framework for evaluating how language models reason about physical interactions, tool usage, and multi-agent coordination in embodied tasks.<n>We model continuous physical properties and complex spatial relationships across 1,500 scenarios spanning household and industrial domains.<n>Our systematic evaluation reveals severe performance degradation when models must reason from constraints.
arXiv Detail & Related papers (2025-08-07T17:54:15Z)
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites [9.58858258192147]
We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites.<n>We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions.<n>Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation.
arXiv Detail & Related papers (2025-04-15T18:22:55Z)
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments [90.29937153770835]
We introduce CRMArena, a benchmark designed to evaluate AI agents on realistic tasks grounded in professional work environments.<n>We show that state-of-the-art LLM agents succeed in less than 40% of the tasks with ReAct prompting, and less than 55% even with function-calling abilities.<n>Our findings highlight the need for enhanced agent capabilities in function-calling and rule-following to be deployed in real-world work environments.
arXiv Detail & Related papers (2024-11-04T17:30:51Z)
AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks? [50.36826943689364]
We study whether language agents can perform realistic and time-consuming tasks on the web. We introduce AssistantBench, a new benchmark consisting of 214 realistic tasks that can be automatically evaluated. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models.
arXiv Detail & Related papers (2024-07-22T15:18:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.