Fugu-MT 論文翻訳(概要): WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

論文の概要: WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

arxiv url: http://arxiv.org/abs/2605.10912v1
Date: Mon, 11 May 2026 17:49:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-12 23:28:51.05061
Title: WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Title（参考訳）: WildClawBench: 実世界の長距離エージェント評価のためのベンチマーク
Authors: Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Yang JingYi, Penghui Yang, Zhixiong Zhang, Xilin Wei, Xinyu Fang, Yubo Ma, Haodong Duan, Jing Shao, Jiaqi Wang, Dahua Lin, Kai Chen, Yuhang Zang,
Abstract要約: この研究でWildClawBenchは、6つのテーマのカテゴリにまたがる60の人間によるバイリンガルなマルチモーダルタスクのネイティブランタイムベンチマークである。各タスクは、約8分間のウォールクロック時間と20以上のツールコールで実行されます。グラディングはハイブリッドであり、決定論的ルールベースのチェック、副作用の環境状態監査、意味的検証のためのLLM/VLM判定を組み合わせている。
参考スコア（独自算出の注目度）: 88.10947115397971
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large language and vision-language models increasingly power agents that act on a user's behalf through command-line interface (CLI) harnesses. However, most agent benchmarks still rely on synthetic sandboxes, short-horizon tasks, mock-service APIs, and final-answer checks, leaving open whether agents can complete realistic long-horizon work in the runtimes where they are deployed. This work presents WildClawBench, a native-runtime benchmark of 60 human-authored, bilingual, multimodal tasks spanning six thematic categories. Each task averages roughly 8 minutes of wall-clock time and over 20 tool calls, and runs inside a reproducible Docker container hosting an actual CLI agent harness (OpenClaw, Claude Code, Codex, or Hermes Agent) with access to real tools rather than mock services. Grading is hybrid, combining deterministic rule-based checks, environment-state auditing of side effects, and an LLM/VLM judge for semantic verification. Across 19 frontier models, the best, Claude Opus 4.7, reaches only 62.2% overall under OpenClaw, while every other model stays below 60%, and switching harness alone shifts a single model by up to 18 points. These results show that long-horizon, native-runtime agent evaluation remains a far-from-resolved task for current frontier models. We release the tasks, code, and containerized tooling to support reproducible evaluation.
Abstract（参考訳）: 大規模言語や視覚言語モデルでは、コマンドラインインターフェース(CLI)を利用することによって、ユーザの代理として機能するエージェントがますますパワーアップしている。しかし、ほとんどのエージェントベンチマークは、シンセサイザーサンドボックス、ショートホリゾンタスク、モックサービスAPI、ファイナルアンサーチェックに依存しており、エージェントがデプロイされるランタイムで現実的なロングホリゾン作業を完了できるかどうかをオープンにしている。この研究でWildClawBenchは、6つのテーマのカテゴリにまたがる60の人間によるバイリンガルなマルチモーダルタスクのネイティブランタイムベンチマークである。各タスクは、約8分間のウォールクロック時間と20以上のツールコールを平均し、実際のCLIエージェントハーネス(OpenClaw、Claude Code、Codex、Hermes Agent)をホストする再現可能なDockerコンテナ内で動作し、モックサービスではなく、実際のツールにアクセスする。グラディングはハイブリッドであり、決定論的ルールベースのチェック、副作用の環境状態監査、意味的検証のためのLLM/VLM判定を組み合わせている。 19台のフロンティアモデルのうち最高のクロード・オプス4.7は、全体の62.2%にしか達せず、他のモデルは60%以下にとどまり、ハーネスのみを最大18ポイント変更する。これらの結果から,現在のフロンティアモデルでは,長期間のネイティブ・ランタイム・エージェント評価が未解決課題として残されていることが示唆された。再現可能な評価をサポートするために、タスク、コード、コンテナ化されたツールをリリースします。

論文の概要: WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

関連論文リスト