Fugu-MT 論文翻訳(概要): OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

論文の概要: OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

arxiv url: http://arxiv.org/abs/2508.09124v1
Date: Tue, 12 Aug 2025 17:53:03 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-13 21:07:34.534643
Title: OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows
Title（参考訳）: OdysseyBench: 長距離複合オフィスアプリケーションワークフロー上でのLLMエージェントの評価
Authors: Weixuan Wang, Dongge Han, Daniel Madrigal Diaz, Jin Xu, Victor Rühle, Saravan Rajmohan,
Abstract要約: 大規模言語モデル(LLM)は、複雑で長期の推論を必要とする現実世界のアプリケーションにますます多くデプロイされている。 OdysseyBenchは、様々なオフィスアプリケーションにわたる長期にわたってLLMエージェントを評価するための包括的なベンチマークである。スケーラブルなベンチマーク作成を実現するために,長期ワークフローベンチマークの自動生成を行うマルチエージェントフレームワークであるHomerAgentsを提案する。
参考スコア（独自算出の注目度）: 10.318744035680398
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Autonomous agents powered by large language models (LLMs) are increasingly deployed in real-world applications requiring complex, long-horizon workflows. However, existing benchmarks predominantly focus on atomic tasks that are self-contained and independent, failing to capture the long-term contextual dependencies and multi-interaction coordination required in realistic scenarios. To address this gap, we introduce OdysseyBench, a comprehensive benchmark for evaluating LLM agents on long-horizon workflows across diverse office applications including Word, Excel, PDF, Email, and Calendar. Our benchmark comprises two complementary splits: OdysseyBench+ with 300 tasks derived from real-world use cases, and OdysseyBench-Neo with 302 newly synthesized complex tasks. Each task requires agent to identify essential information from long-horizon interaction histories and perform multi-step reasoning across various applications. To enable scalable benchmark creation, we propose HomerAgents, a multi-agent framework that automates the generation of long-horizon workflow benchmarks through systematic environment exploration, task generation, and dialogue synthesis. Our extensive evaluation demonstrates that OdysseyBench effectively challenges state-of-the-art LLM agents, providing more accurate assessment of their capabilities in complex, real-world contexts compared to existing atomic task benchmarks. We believe that OdysseyBench will serve as a valuable resource for advancing the development and evaluation of LLM agents in real-world productivity scenarios. In addition, we release OdysseyBench and HomerAgents to foster research along this line.
Abstract（参考訳）: 大規模言語モデル(LLM)を利用した自律エージェントは、複雑で長期にわたるワークフローを必要とする現実のアプリケーションにますます多くデプロイされている。しかし、既存のベンチマークは、主に自己完結的で独立したアトミックなタスクに焦点を当てており、現実的なシナリオで必要とされる長期のコンテキスト依存やマルチインタラクション調整を捉えていない。このギャップに対処するため、私たちはOdysseyBenchを紹介します。これはWord、Excel、PDF、Email、Calendarを含む様々なオフィスアプリケーションにわたる長い水平ワークフロー上でLLMエージェントを評価するための包括的なベンチマークです。 OdysseyBench+は実世界のユースケースから派生したタスク300、OdysseyBench-Neoは新たに合成された複雑なタスク302である。各タスクは、長い水平相互作用履歴から必須情報を識別し、様々なアプリケーションにまたがって多段階の推論を行う。スケーラブルなベンチマーク作成を実現するために,マルチエージェントフレームワークであるHomerAgentsを提案する。我々はOdysseyBenchが最先端のLLMエージェントに効果的に挑戦し、既存のアトミックタスクベンチマークと比較して複雑な実世界の文脈でそれらの能力をより正確に評価できることを実証した。我々は,OdysseyBenchが実世界の生産性シナリオにおけるLLMエージェントの開発と評価を促進する上で,貴重な資源となると考えている。さらに、我々はOdysseyBench と HomerAgents をリリースし、この線に沿って研究を促進する。

論文の概要: OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

関連論文リスト