Related papers: OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions

URL: http://arxiv.org/abs/2602.05843v1
Date: Thu, 05 Feb 2026 16:31:43 GMT
Title: OdysseyArena: Benchmarking Large Language Models For Long-Horizon, Active and Inductive Interactions
Authors: Fangzhi Xu, Hang Yan, Qiushi Sun, Jinyang Wu, Zixian Huang, Muye Huang, Jingyang Gong, Zichen Ding, Kanzhi Cheng, Yian Wang, Xinyu Che, Zeyi Sun, Jian Zhang, Zhangyue Yin, Haoran Luo, Xuanjing Huang, Ben Kao, Jun Liu, Qika Lin,
Abstract summary: We introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions.<n>We provide a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery.<n>We also introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons.
Score: 66.84396313837765
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid advancement of Large Language Models (LLMs) has catalyzed the development of autonomous agents capable of navigating complex environments. However, existing evaluations primarily adopt a deductive paradigm, where agents execute tasks based on explicitly provided rules and static goals, often within limited planning horizons. Crucially, this neglects the inductive necessity for agents to discover latent transition laws from experience autonomously, which is the cornerstone for enabling agentic foresight and sustaining strategic coherence. To bridge this gap, we introduce OdysseyArena, which re-centers agent evaluation on long-horizon, active, and inductive interactions. We formalize and instantiate four primitives, translating abstract transition dynamics into concrete interactive environments. Building upon this, we establish OdysseyArena-Lite for standardized benchmarking, providing a set of 120 tasks to measure an agent's inductive efficiency and long-horizon discovery. Pushing further, we introduce OdysseyArena-Challenge to stress-test agent stability across extreme interaction horizons (e.g., > 200 steps). Extensive experiments on 15+ leading LLMs reveal that even frontier models exhibit a deficiency in inductive scenarios, identifying a critical bottleneck in the pursuit of autonomous discovery in complex environments. Our code and data are available at https://github.com/xufangzhi/Odyssey-Arena

Related papers

WaterVideoQA: ASV-Centric Perception and Rule-Compliant Reasoning via Multi-Modal Agents [23.828845891763617]
We present WaterVideoQA, the first large-scale, comprehensive Video Question Answering benchmark specifically engineered for all-waterway environments.<n>We also introduce NaviMind, a pioneering multi-agent neuro-symbolic system designed for open-ended maritime reasoning.
arXiv Detail & Related papers (2026-02-26T12:12:40Z)
AgentLongBench: A Controllable Long Benchmark For Long-Contexts Agents via Environment Rollouts [78.33143446024485]
We introduce textbfAgentLongBench, which evaluates agents through simulated environment rollouts based on Lateral Thinking Puzzles.<n>This framework generates rigorous interaction trajectories across knowledge-intensive and knowledge-free scenarios.
arXiv Detail & Related papers (2026-01-28T16:05:44Z)
The Why Behind the Action: Unveiling Internal Drivers via Agentic Attribution [63.61358761489141]
Large Language Model (LLM)-based agents are widely used in real-world applications such as customer service, web navigation, and software engineering.<n>We propose a novel framework for textbfgeneral agentic attribution, designed to identify the internal factors driving agent actions regardless of the task outcome.<n>We validate our framework across a diverse suite of agentic scenarios, including standard tool use and subtle reliability risks like memory-induced bias.
arXiv Detail & Related papers (2026-01-21T15:22:21Z)
Toward Ultra-Long-Horizon Agentic Science: Cognitive Accumulation for Machine Learning Engineering [59.18634614089481]
We present ML-Master 2.0, an autonomous agent that masters ultra-long-horizon machine learning engineering (MLE)<n>By reframing context management as a process of cognitive accumulation, our approach introduces Hierarchical Cognitive Caching (HCC)<n>HCC allows agents to decouple immediate execution from long-term experimental strategy.<n>In evaluations on OpenAI's MLE-Bench under 24-hour budgets, ML-Master 2.0 achieves a state-of-the-art medal rate of 56.44%.
arXiv Detail & Related papers (2026-01-15T13:52:04Z)
UltraHorizon: Benchmarking Agent Capabilities in Ultra Long-Horizon Scenarios [63.67884284105684]
We introduce textbfUltraHorizon, a novel benchmark that measures the foundational capabilities essential for complex real-world challenges.<n>Agents are designed in long-horizon discovery tasks where they must iteratively uncover hidden rules.<n>Our experiments reveal that LLM-agents consistently underperform in these settings, whereas human participants achieve higher scores.
arXiv Detail & Related papers (2025-09-26T02:04:00Z)
STARec: An Efficient Agent Framework for Recommender Systems via Autonomous Deliberate Reasoning [54.28691219536054]
We introduce STARec, a slow-thinking augmented agent framework that endows recommender systems with autonomous deliberative reasoning capabilities.<n>We develop anchored reinforcement training - a two-stage paradigm combining structured knowledge distillation from advanced reasoning models with preference-aligned reward shaping.<n>Experiments on MovieLens 1M and Amazon CDs benchmarks demonstrate that STARec achieves substantial performance gains compared with state-of-the-art baselines.
arXiv Detail & Related papers (2025-08-26T08:47:58Z)
OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows [10.318744035680398]
Large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon reasoning.<n>OdysseyBench is a comprehensive benchmark for evaluating LLM agents on long-horizon across diverse office applications.<n>To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks.
arXiv Detail & Related papers (2025-08-12T17:53:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.