Fugu-MT 論文翻訳(概要): ClawArena: Benchmarking AI Agents in Evolving Information Environments

論文の概要: ClawArena: Benchmarking AI Agents in Evolving Information Environments

arxiv url: http://arxiv.org/abs/2604.04202v1
Date: Sun, 05 Apr 2026 17:55:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-07 15:49:18.979757
Title: ClawArena: Benchmarking AI Agents in Evolving Information Environments
Title（参考訳）: ClawArena: 情報環境の進化におけるAIエージェントのベンチマーク
Authors: Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao,
Abstract要約: ClawArenaは、進化する情報環境におけるAIエージェントの評価のためのベンチマークである。それぞれのシナリオは、エージェントをノイズ、部分的、時には矛盾するトレースだけに露呈しながら、完全に隠された地上の真実を維持します。評価は、マルチソースコンフリクト推論、動的信念修正、暗黙のパーソナライゼーションという3つの複合的な課題に基づいて構成される。
参考スコア（独自算出の注目度）: 61.664633997138004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.
Abstract（参考訳）: 永続的なアシスタントとしてデプロイされるAIエージェントは、情報環境が進化するにつれて、正しい信念を維持する必要がある。実際には、エビデンスはしばしば矛盾する異質なソースに分散し、新しい情報は以前の結論を無効にし、明示的な指示ではなく修正によってユーザーの嗜好を表面化する。既存のベンチマークでは、主に静的でシングルオーソリティの設定を前提としており、エージェントがこの複雑さに対処できるかどうか評価していない。我々は、進化する情報環境においてAIエージェントを評価するためのベンチマークであるClawArenaを紹介する。各シナリオは、エージェントをノイズ、部分的、時には矛盾するトレースのみに公開しながら、完全に隠された土台真実を保持します。評価は、マルチソースのコンフリクト推論、動的信念の修正、暗黙のパーソナライゼーションの3つの課題に基づいて構成される。複数選択(セット選択)とシェルベースの実行可能チェックという2つの質問形式は、推論とワークスペースグラウンドの両方をテストする。現在のリリースには8つのプロフェッショナルドメインにわたる64のシナリオが含まれており、合計1{,}879の評価ラウンドと365の動的更新が含まれている。 5つのエージェントフレームワークと5つの言語モデルの実験では、モデル能力(15.4%の範囲)とフレームワーク設計(9.2%)の両方がパフォーマンスに大きく影響し、自己進化するスキルフレームワークは部分的にモデル能力のギャップを埋めることができ、信念の再定義の難しさは、単に更新が存在するのではなく、デザイン戦略の更新によって決定される。コードはhttps://github.com/aiming-lab/ClawArenaで入手できる。

論文の概要: ClawArena: Benchmarking AI Agents in Evolving Information Environments

関連論文リスト