Fugu-MT 論文翻訳(概要): Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

論文の概要: Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

arxiv url: http://arxiv.org/abs/2604.24964v1
Date: Mon, 27 Apr 2026 20:05:41 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 16:49:17.582775
Title: Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
Title（参考訳）: Odysseys: リアルなロング水平タスクでWebエージェントをベンチマークする
Authors: Lawrence Keunho Jang, Jing Yu Koh, Daniel Fried, Ruslan Salakhutdinov,
Abstract要約: 我々はOdysseysについて紹介する。Odysseysは、ライブインターネット上で評価された実世界のブラウジングセッションから得られた200のロングホライゾンWebタスクのベンチマークである。その結果,2進パス/フェイル評価は長距離設定では不十分であり,各Odysseysタスクに平均6.1グレードのルーリックをアノテートするルーリックに基づく評価が導入された。最強のモデルは44.5%の成功率に達しており、将来の改善の余地は十分にある。
参考スコア（独自算出の注目度）: 67.44219836008348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing web agent benchmarks have largely converged on short, single-site tasks that frontier models are approaching saturation on. However, real world web use consists of long-horizon, multi-site workflows. Common web navigation tasks, such as comparing products across different domains, planning trips across multiple services, or summarizing information from multiple search queries, require sustained context and cross-site reasoning over potentially hours of browsing. To capture and evaluate such behaviors, we introduce Odysseys: a benchmark of 200 long-horizon web tasks derived from real world browsing sessions evaluated on the live Internet. We find that binary pass/fail evaluation is inadequate for long-horizon settings and introduce a rubric-based evaluation, annotating each Odysseys task with an average of 6.1 graded rubrics. We demonstrate that this yields higher agreement with humans and provides a more fine-grained signal than commonly used trajectory-level LLM-as-a-judge evaluation metrics. We tested several leading frontier models and find that the strongest models achieve a success rate of 44.5%, which leaves substantial room for future improvements. Beyond task success, we argue that efficiency is a first-class concern for long-horizon agents. We introduce a Trajectory Efficiency metric (rubric score per step) and find that even frontier agents achieve only 1.15%, marking an evident need for agents that can succeed efficiently and not simply eventually. Odysseys isolates the critical evaluation of long-horizon proficiency in open-web environments, providing a realistic benchmark to measure progress towards computer-use agents that can potentially productively operate for hours. We release our tasks, evaluation scripts, and other results at https://odysseys-website.pages.dev
Abstract（参考訳）: 既存のWebエージェントベンチマークは、フロンティアモデルが飽和に近づいている短い単一サイトタスクに大きく収束している。しかし、現実世界のWeb利用は、長期にわたる多サイトワークフローで構成されている。さまざまなドメインにわたる製品の比較、複数のサービス横断の計画旅行、複数の検索クエリからの情報の要約といった一般的なWebナビゲーションタスクは、潜在的に数時間のブラウジングに対して、持続的なコンテキストとクロスサイト推論を必要とする。 Odysseys: a benchmark of 200 long-horizon web task from real world browsing sessions based on the live Internet。その結果,2進パス/フェイル評価は長距離設定では不十分であり,各Odysseysタスクに平均6.1グレードのルーリックをアノテートするルーリックに基づく評価が導入された。我々は,この手法が人間との一致度を高め,一般的な軌跡レベルのLCM-as-a-judge評価指標よりもきめ細かな信号を提供することを示した。我々はいくつかの主要なフロンティアモデルをテストし、最強のモデルが44.5%の成功率を達成することを発見した。タスクの成功以外にも、効率性は長期的なエージェントにとって第一級の関心事であると主張する。トラジェクトリー効率測定(ステップ当たりのルブリックスコア)を導入し、フロンティアエージェントでさえ1.15%しか達成できず、効率よく成功できるエージェントの明確な必要性を示す。 Odysseysは、オープンなWeb環境における長期的熟練度に対する批判的な評価を分離し、何時間も生産的に操作できるコンピュータ利用エージェントへの進捗を計測するためのリアルなベンチマークを提供する。タスク、評価スクリプト、その他の結果はhttps://odysseys-website.devで公開しています。

論文の概要: Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks

関連論文リスト