Fugu-MT 論文翻訳(概要): What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

論文の概要: What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

arxiv url: http://arxiv.org/abs/2604.28093v1
Date: Thu, 30 Apr 2026 16:37:37 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-01 16:31:54.202233
Title: What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design
Title（参考訳）: 優れた端末エージェントベンチマークタスクとは何か: 逆境, 困難, そして, 妥当な評価設計のためのガイドライン
Authors: Ivan Bercovich,
Abstract要約: 本稿では,端末ベンチのための優れたベンチマークタスクを記述するためのガイドラインである。よいタスクは敵意があり、困難で、正当である、と私たちは主張する。一般的な端末エージェントベンチマークにおけるタスクの15%以上が報奨可能であるという最近の実証的証拠について論じる。
参考スコア（独自算出の注目度）: 1.8420149175440346
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Terminal-agent benchmarks have become a primary signal for measuring the coding and system-administration capabilities of large language models. As the market for evaluation environments grows, so does the pressure to ship tasks quickly, often without thorough adversarial review of the verification logic. This paper is a guideline for writing good benchmark tasks, drawn from over a year of contributing to and reviewing tasks for Terminal Bench. Most people write benchmark tasks the way they write prompts. They shouldn't. A prompt is designed to help the agent succeed; a benchmark is designed to find out if it can. We argue that good tasks are adversarial, difficult, and legible, and that a large class of common failure modes -- AI-generated instructions, over-prescriptive specifications, clerical difficulty, oracle solutions that assume hidden knowledge, tests that validate the wrong things, and reward-hackable environments -- are predictable consequences of treating task authoring as prompt authoring. We catalog these failure modes, argue that real difficulty is conceptual rather than environmental, and discuss recent empirical evidence that over 15% of tasks in popular terminal-agent benchmarks are reward-hackable. We hope this serves as a useful reference for benchmark maintainers, task contributors, and researchers using benchmark scores as evidence.
Abstract（参考訳）: 端末エージェントベンチマークは、大規模言語モデルの符号化能力とシステム管理能力を測定する主要な信号となっている。評価環境の市場が拡大するにつれて、検証ロジックを徹底的に検証することなく、迅速にタスクを出荷する圧力も高まる。本稿では1年以上にわたるContination Benchのタスクへのコントリビューションとレビューから,優れたベンチマークタスクを書くためのガイドラインである。ほとんどの人は、プロンプトを書くのと同じように、ベンチマークタスクを書きます。彼らはすべきではない。プロンプトはエージェントが成功するのを助けるように設計されている。優れたタスクは敵意があり、困難で、妥当であり、AI生成の命令、過剰な記述仕様、聖職者の難しさ、隠れた知識を仮定するオラクルソリューション、間違ったことを検証するテスト、そして報奨可能な環境など、多くの一般的な障害モードは、タスクオーサリングをプロンプトオーサリングとして扱うという予測可能な結果である、と私たちは主張する。我々はこれらの障害モードをカタログ化し、実際の困難は環境よりも概念的であると主張し、人気のある端末エージェントベンチマークにおけるタスクの15%以上が報奨可能であるという最近の実証的証拠について議論する。これは、ベンチマークのメンテナ、タスクコントリビュータ、そしてベンチマークスコアを証拠として使用する研究者にとって有用なリファレンスになることを期待しています。

論文の概要: What Makes a Good Terminal-Agent Benchmark Task: A Guideline for Adversarial, Difficult, and Legible Evaluation Design

関連論文リスト