Related papers: UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios

URL: http://arxiv.org/abs/2509.21766v1
Date: Fri, 26 Sep 2025 02:04:00 GMT
Title: UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios
Authors: Haotian Luo, Huaisong Zhang, Xuelin Zhang, Haoyu Wang, Zeyu Qin, Wenjie Lu, Guozheng Ma, Haiying He, Yingsha Xie, Qiyang Zhou, Zixuan Hu, Hongze Mi, Yibo Wang, Naiqiang Tan, Hong Chen, Yi R. Fung, Chun Yuan, Li Shen,
Abstract summary: We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
Score: 63.67884284105684
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Autonomous agents have recently achieved remarkable progress across diverse domains, yet most evaluations focus on short-horizon, fully observable tasks. In contrast, many critical real-world tasks, such as large-scale software development, commercial investment, and scientific discovery, unfold in long-horizon and partially observable scenarios where success hinges on sustained reasoning, planning, memory management, and tool use. Existing benchmarks rarely capture these long-horizon challenges, leaving a gap in systematic evaluation. To bridge this gap, we introduce \textbf{UltraHorizon} a novel benchmark that measures the foundational capabilities essential for complex real-world challenges. We use exploration as a unifying task across three distinct environments to validate these core competencies. Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules through sustained reasoning, planning, memory and tools management, and interaction with environments. Under the heaviest scale setting, trajectories average \textbf{200k+} tokens and \textbf{400+} tool calls, whereas in standard configurations they still exceed \textbf{35k} tokens and involve more than \textbf{60} tool calls on average. Our extensive experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores, underscoring a persistent gap in agents' long-horizon abilities. We also observe that simple scaling fails in our task. To better illustrate the failure of agents, we conduct an in-depth analysis of collected trajectories. We identify eight types of errors and attribute them to two primary causes: in-context locking and functional fundamental capability gaps. \href{https://github.com/StarDewXXX/UltraHorizon}{Our code will be available here.}

Related papers

BeyondSWE: Can Current Code Agent Survive Beyond Single-Repo Bug Fixing? [61.247730037229815]
We introduce BeyondSWE, a comprehensive benchmark that broadens existing evaluations along two axes - resolution scope and knowledge scope.<n>To investigate the role of external knowledge, we develop SearchSWE, a framework that integrates deep search with coding abilities.<n>This work offers both a realistic, challenging evaluation benchmark and a flexible framework to advance research toward more capable code agents.
arXiv Detail & Related papers (2026-03-03T17:52:01Z)
LongCLI-Bench: A Preliminary Benchmark and Study for Long-horizon Agentic Programming in Command-Line Interfaces [65.11019654023978]
LongCLI-Bench is a benchmark designed to evaluate agentic capabilities across long-horizon, realistic tasks.<n>We curated 20 high-quality, long-horizon tasks from over 1,000 computer science assignments and real-world tasks.<n>Experiments reveal that even state-of-the-art agents achieve pass rates below 20% in LongCLI-Bench.
arXiv Detail & Related papers (2026-02-15T23:12:57Z)
Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers [27.817039954088315]
We introduce textbfGeoEvolver, a self-evolving multi-agent system for learning tool-level expertise.<n>We show that GeoEvolver consistently improves end-to-end task success, with an average gain of 12% across multiple backbones.
arXiv Detail & Related papers (2026-01-30T15:11:07Z)
What Do LLM Agents Know About Their World? Task2Quiz: A Paradigm for Studying Environment Understanding [50.35012849818872]
Large language model (LLM) agents have demonstrated remarkable capabilities in complex decision-making and tool-use tasks.<n>We propose Task-to-Quiz (T2Q), a deterministic and automated evaluation paradigm designed to decouple task execution from world-state understanding.<n>Our experiments reveal that task success is often a poor proxy for environment understanding, and that current memory machanism can not effectively help agents acquire a grounded model of the environment.
arXiv Detail & Related papers (2026-01-14T14:09:11Z)
Multi-Agent Craftax: Benchmarking Open-Ended Multi-Agent Reinforcement Learning at the Hyperscale [53.08403177911567]
textitCraftax-MA is an extension of the popular open-ended RL environment, Craftax.<n>textitCraftax-Coop introduces heterogeneous agents, trading and more mechanics that require complex cooperation among agents for success.
arXiv Detail & Related papers (2025-11-07T01:09:36Z)
BrowseMaster: Towards Scalable Web Browsing via Tool-Augmented Programmatic Agent Pair [28.052062258597225]
Current large language model (M)-based agents struggle to achieve balance due to limitations in search breadth and reasoning depth.<n>We propose BrowseMaster framework built around augmented planner-executor agent pair.<n>Tests on English and Chinese show that BrowseMaster consistently outperforms open benchmarks and proprietary baselines, achieving scores of 3 on BrowseComp-en and 46.5 on BrowseComp-zh, which demonstrates its strong capability in complex, reasoning-heavy information-seeking tasks at scale.
arXiv Detail & Related papers (2025-08-12T17:56:25Z)
Visual Document Understanding and Question Answering: A Multi-Agent Collaboration Framework with Test-Time Scaling [83.78874399606379]
We propose MACT, a Multi-Agent Collaboration framework with Test-Time scaling.<n>It comprises four distinct small-scale agents, with clearly defined roles and effective collaboration.<n>It shows superior performance with a smaller parameter scale without sacrificing the ability of general and mathematical tasks.
arXiv Detail & Related papers (2025-08-05T12:52:09Z)
Breakpoint: Scalable evaluation of system-level reasoning in LLM code agents [40.37993572657772]
We introduce Breakpoint, a benchmarking methodology that automatically generates code-repair tasks by adversarially corrupting functions.<n>We demonstrate that our methodology can scale to arbitrary difficulty, with state-of-the-art models' success rates ranging from 55% on the easiest tasks down to 0% on the hardest.
arXiv Detail & Related papers (2025-05-30T19:23:51Z)
Agent-X: Evaluating Deep Multimodal Reasoning in Vision-Centric Agentic Tasks [94.19506319646376]
We introduce Agent-X, a benchmark for evaluating vision-centric agents in real-world, multimodal settings.<n>Agent-X features 828 agentic tasks with authentic visual contexts, including images, multi-image comparisons, videos, and instructional text.<n>Our results reveal that even the best-performing models, including GPT, Gemini, and Qwen families, struggle to solve multi-step vision tasks.
arXiv Detail & Related papers (2025-05-30T17:59:53Z)
Spatial Reasoning and Planning for Deep Embodied Agents [2.7195102129095003]
This thesis explores the development of data-driven techniques for spatial reasoning and planning tasks. It focuses on enhancing learning efficiency, interpretability, and transferability across novel scenarios.
arXiv Detail & Related papers (2024-09-28T23:05:56Z)
MMAU: A Holistic Benchmark of Agent Capabilities Across Diverse Domains [54.117238759317004]
Massive Multitask Agent Understanding (MMAU) benchmark features comprehensive offline tasks that eliminate the need for complex environment setups. It evaluates models across five domains, including Tool-use, Directed Acyclic Graph (DAG) QA, Data Science and Machine Learning coding, Contest-level programming and Mathematics. With a total of 20 meticulously designed tasks encompassing over 3K distinct prompts, MMAU provides a comprehensive framework for evaluating the strengths and limitations of LLM agents.
arXiv Detail & Related papers (2024-07-18T00:58:41Z)
LoHoRavens: A Long-Horizon Language-Conditioned Benchmark for Robotic Tabletop Manipulation [38.66406497318709]
This work focuses on the tabletop manipulation task and releases a simulation benchmark, textitLoHoRavens, which covers various long-horizon reasoning aspects spanning color, size, space, arithmetics and reference. We investigate two methods of bridging the modality gap: caption generation and learnable interface for incorporating explicit and implicit observation feedback to the LLM.
arXiv Detail & Related papers (2023-10-18T14:53:14Z)
Wish you were here: Hindsight Goal Selection for long-horizon dexterous manipulation [14.901636098553848]
Solving tasks with a sparse reward in a sample-efficient manner poses a challenge to modern reinforcement learning. Existing strategies explore based on task-agnostic goal distributions, which can render the solution of long-horizon tasks impractical. We extend hindsight relabelling mechanisms to guide exploration along task-specific distributions implied by a small set of successful demonstrations.
arXiv Detail & Related papers (2021-12-01T16:12:32Z)
Batch Exploration with Examples for Scalable Robotic Reinforcement Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states. BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.