Fugu-MT 論文翻訳(概要): OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

論文の概要: OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

arxiv url: http://arxiv.org/abs/2605.29253v1
Date: Thu, 28 May 2026 02:15:52 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-30 05:02:24.55606
Title: OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories
Title（参考訳）: OpenClawBench: 実世界のエージェント実行軌跡におけるプロセス側異常のベンチマーク
Authors: Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han,
Abstract要約: 実エージェント実行プロセスにおけるプロセス側異常の測定と監視のための大規模データセットであるOpenClawBenchを紹介する。 OpenClawBenchは6つのソースモデルによって生成されたBFCL駆動のOpenClawセッションから構築され、31,264の注釈付きトラジェクトリを含んでいる。 FullTaxは、アライメントされた軌跡を、バイナリラベル、エビデンス、オンセット/スパンのローカライゼーション、重度、回復性、および5クラスの異常分類といった構造化された異常管理に変換する。
参考スコア（独自算出の注目度）: 24.616751291282046
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.
Abstract（参考訳）: タスク成功は、現実世界のエージェントの実行でプロセス異常を隠すことができる。エージェントは、未解決の曖昧さ、安全でない外部書き込み、無視されたエラー、弱い根拠のあるコミットメント、能力境界オーバーコミットを蓄積しながら、最終的なタスクオラクルをパスすることができる。我々はこのミスマッチをOutcome-Process Gapとして研究し、実際のエージェント実行プロセスにおけるプロセス側異常の測定と監視のための大規模データセットOpenClawBenchを紹介した。 OpenClawBenchは6つのソースモデルによって生成されたBFCL駆動のOpenClawセッションから構築され、31,264の注釈付きトラジェクトリを含んでいる。タスク・オラクルの結果と構造化プロセスのエビデンスを一致させる。 FullTaxは、アライメントされた軌跡を、バイナリラベル、エビデンス、オンセット/スパンのローカライゼーション、重度、回復性、および5クラスの異常分類といった構造化された異常管理に変換する。 OpenClawBenchを使うことで、Outcome-Process Gapの測定が可能になります。 31,135件のオラクルパス実行のうち、2,904件はFullTaxの下でまだプロセス異常とラベル付けされている。これらの結果から,実エージェント実行におけるプロセス側障害の具体的なクラスを,成功のみの評価が欠落していることが示唆された。高信頼のFullTax制御プールで訓練されたLoRA微調整のGemma 3 12B検出器は、よりクリーンなテストスプリットでF1=0.729に到達した。 OpenClawBenchは、実際のエージェント実行ログを、ランタイムエージェントの信頼性の調査、診断、運用監視のための監査可能な再利用可能な監視に変換する。

論文の概要: OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

関連論文リスト