Fugu-MT 論文翻訳(概要): TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

論文の概要: TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

arxiv url: http://arxiv.org/abs/2605.22535v1
Date: Thu, 21 May 2026 14:24:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-22 16:35:42.294853
Title: TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks
Title（参考訳）: TerminalWorld: リアルタイムターミナルタスクのベンチマークエージェント
Authors: Zhaoyang Chu, Jiarui Hu, Xingyu Jiang, Pengyu Zou, Han Li, Chao Peng, Peter O'Hearn, Earl T. Barr, Mark Harman, Federica Sarro, He Ye,
Abstract要約: TerminalWorldはスケーラブルなデータエンジンで、"in-the-wild"端末からの高忠実度評価タスクを自動的にリバースエンジニアリングする。エンジンは1,530の検証されたタスクの完全なベンチマークを取得し、18の現実世界のカテゴリにまたがる。 TerminalWorldは、既存のExpert-Verifiedベンチマークとは異なる現実世界の端末機能をキャプチャする。
参考スコア（独自算出の注目度）: 23.863417507169697
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce TerminalWorld, a scalable data engine that automatically reverse-engineers high-fidelity evaluation tasks from "in-the-wild" terminal recordings. Processing 80,870 terminal recordings, the engine yields a full benchmark of 1,530 validated tasks, spanning 18 real-world categories, ranging from short everyday operations to workflows exceeding 50 steps, and covering 1,280 unique commands. From these, we curate a Verified subset of 200 representative, manually reviewed tasks. Comprehensive benchmarking on TerminalWorld-Verified across eight frontier models and six agents reveals that current systems still struggle with authentic terminal workflows, achieving a maximum pass rate of only 62.5%. Moreover, TerminalWorld captures real-world terminal capabilities distinct from existing expert-curated benchmarks (e.g., Terminal-Bench), with only a weak correlation to their scores (Pearson r=0.20). The automated engine makes TerminalWorld authentic and scalable by construction, enabling it to evaluate agents in real-world terminal environments as developer practices evolve. Data and code are available at https://github.com/EuniAI/TerminalWorld.
Abstract（参考訳）: 端末記録から高忠実度評価タスクを自動的にリバースエンジニアリングするスケーラブルなデータエンジンであるTerminalWorldを紹介する。 80,870の端末記録を処理し、1,530の検証されたタスクの完全なベンチマークを出力し、18の現実世界のカテゴリにまたがる。これらから、200人の代表者による検証済みのサブセットを手作業でレビューする。 8つのフロンティアモデルと6つのエージェントで検証されたターミナルワールドの総合的なベンチマークによると、現在のシステムは依然として真の端末ワークフローに苦戦しており、最高パスレートは62.5%である。さらに、ContinationWorldは、既存の専門家によるベンチマーク(例えば、Contination-Bench)とは異なる実世界の端末機能を、スコアと弱い相関しか持たない(Pearson r=0.20)。自動エンジンにより、ContinationWorldの信頼性とスケーラビリティが向上し、開発者のプラクティスが進化するにつれて、現実の端末環境におけるエージェントの評価が可能になる。データとコードはhttps://github.com/EuniAI/TerminalWorldで入手できる。

論文の概要: TerminalWorld: Benchmarking Agents on Real-World Terminal Tasks

関連論文リスト