Fugu-MT 論文翻訳(概要): Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

論文の概要: Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

arxiv url: http://arxiv.org/abs/2605.27922v1
Date: Wed, 27 May 2026 03:47:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.732667
Title: Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows
Title（参考訳）: Harness-Bench: 現実的なエージェントワークフローにおけるモデル間のハーネス効果の測定
Authors: Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li, Zhengyang Wang, Wenhan Yu, Zhewen Tan, Yuxuan Tian, Guangxiang Zhao, Lin Sun, Xiangzheng Zhang, Tong Yang,
Abstract要約: 本稿では,リアルエージェントシステムにおける構成レベルのハーネス効果を評価するための診断ベンチマークであるHarness-Benchを紹介する。ベンチマークには、実用的なエージェント使用パターンから構築された106のサンドボックス化されたオフラインタスクが含まれている。 5,194個の実行軌道にまたがって、完了、プロセス品質、効率、障害挙動のかなりの変化を観察する。
参考スコア（独自算出の注目度）: 18.6534256358905
License: http://creativecommons.org/licenses/by/4.0/
Abstract: LLM agents are increasingly deployed as executable systems that use tools, modify workspaces, and produce concrete artifacts. In such workflows, performance depends not only on the base model, but also on the harness: the system layer that manages context, tools, state, constraints, permissions, tracing, and recovery. However, existing benchmarks typically abstract away execution, compare complete agent systems, or hold the harness fixed, making execution-layer variation difficult to study. We introduce Harness-Bench, a diagnostic benchmark for evaluating configuration-level harness effects in realistic agent workflows. Harness-Bench evaluates representative harness configurations across multiple model backends under shared task environments, budgets, and evaluation protocols, while preserving each harness's native execution behavior. The benchmark contains 106 sandboxed offline tasks constructed from practical agent-use patterns and manually reviewed for realism, solvability, oracle-checkability, and integrity. Each run records final artifacts, execution traces, usage statistics, and validator outputs, enabling analysis beyond final completion. Across 5,194 execution trajectories, we observe substantial variation in completion, process quality, efficiency, and failure behavior across model-harness pairings. These results suggest that agent capability should be reported at the model-harness configuration level rather than attributed to the base model alone. Our analysis further identifies recurring execution-alignment failures, where plausible reasoning becomes decoupled from tool feedback, workspace state, evidence, or verifiable output contracts. Harness-Bench provides a reproducible foundation for diagnosing and improving reliable, efficient, and auditable agent execution stacks.
Abstract（参考訳）: LLMエージェントは、ツールを使用し、ワークスペースを変更し、具体的なアーティファクトを生成する実行可能なシステムとして、ますます多くデプロイされている。このようなワークフローでは、パフォーマンスはベースモデルだけでなく、コンテキスト、ツール、状態、制約、パーミッション、トレース、リカバリを管理するシステム層にも依存します。しかし、既存のベンチマークは通常、実行を抽象化したり、完全なエージェントシステムを比較したり、ハーネスを固定したりすることで、実行層の違いを研究するのが難しくなる。本稿では,現実的なエージェントワークフローにおける構成レベルのハーネス効果を評価するための診断ベンチマークであるHarness-Benchを紹介する。 Harness-Benchは、複数のモデルバックエンドにまたがる代表的ハーネス構成を、共通のタスク環境、予算、評価プロトコルの下で評価すると同時に、各ハーネスのネイティブな実行動作を保存する。ベンチマークには、実用的なエージェント使用パターンから構築された106のサンドボックスのオフラインタスクが含まれており、リアリズム、可解性、オラクルチェック可能性、整合性について手作業でレビューされている。各実行は最終アーティファクト、実行トレース、使用統計、バリデータ出力を記録し、最終完了以上の分析を可能にする。 5,194個の実行軌道にまたがって, モデルハーネスペアリングにおける完了, プロセス品質, 効率, 障害挙動のかなりの変化を観測する。これらの結果から,エージェント能力は基本モデルのみによるものではなく,モデルハーネス設定レベルで報告されるべきであることが示唆された。そこでは,ツールフィードバックやワークスペースの状態,エビデンス,検証可能な出力契約から,妥当な推論が切り離される。 Harness-Benchは信頼性、効率的、監査可能なエージェント実行スタックを診断し改善するための再現可能な基盤を提供する。

論文の概要: Harness-Bench: Measuring Harness Effects across Models in Realistic Agent Workflows

関連論文リスト