Fugu-MT 論文翻訳(概要): ClawBench: Can AI Agents Complete Everyday Online Tasks?

論文の概要: ClawBench: Can AI Agents Complete Everyday Online Tasks?

arxiv url: http://arxiv.org/abs/2604.08523v1
Date: Thu, 09 Apr 2026 17:57:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-10 18:34:06.063816
Title: ClawBench: Can AI Agents Complete Everyday Online Tasks?
Title（参考訳）: ClawBench: AIエージェントは毎日のオンラインタスクを完了できるか?
Authors: Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen,
Abstract要約: ClawBenchは153のシンプルなタスクの評価フレームワークで、人々が人生や仕事で定期的に達成する必要がある。 ClawBenchは本番Webサイトで動作し、実世界のWebインタラクションの完全な複雑さ、動的な性質、課題を保存する。軽量なインターセプション層は、最終的なリクエストのみをキャプチャしてブロックし、現実世界の副作用なしに安全な評価を保証する。
参考スコア（独自算出の注目度）: 50.958690494341106
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.
Abstract（参考訳）: AIエージェントはあなたの受信箱を自動化できるかもしれませんが、彼らはあなたの生活の他の日常的な側面を自動化できますか? 毎日のオンラインタスクは、次世代AIエージェントを評価するために、現実的で解決されていないテストベッドを提供する。この目的のためにClawBenchを紹介した。ClawBenchは15のカテゴリーにまたがる144のライブプラットフォームにまたがる153の簡単なタスクの評価フレームワークで、購入や予約の完了から求職の申請までをカバーしている。これらのタスクには、ユーザが提供するドキュメントから関連する情報を取得すること、さまざまなプラットフォームをまたがる複数のステップワークフローをナビゲートすること、多くの詳細なフォームを正しく入力するなどの書き込み負荷の高い操作など、既存のベンチマーク以上の機能を必要とする。オフラインのサンドボックス内のエージェントを静的なページで評価する既存のベンチマークとは異なり、ClawBenchは実世界のWebインタラクションの完全な複雑さ、動的な性質、課題を保ちながら、本番Webサイトで動作する。軽量なインターセプション層は、最終的なリクエストのみをキャプチャしてブロックし、現実世界の副作用なしに安全な評価を保証する。 7つのフロンティアモデルの評価は、プロプライエタリモデルとオープンソースモデルの両方がこれらのタスクのごく一部しか完了できないことを示している。例えば、Claude Sonnet 4.6は33.3%しか達成していない。 ClawBenchの進歩は、信頼できる汎用アシスタントとして機能するAIエージェントに私たちを近づけます。

論文の概要: ClawBench: Can AI Agents Complete Everyday Online Tasks?

関連論文リスト