Fugu-MT 論文翻訳(概要): ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

論文の概要: ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

arxiv url: http://arxiv.org/abs/2604.18543v2
Date: Tue, 28 Apr 2026 07:46:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-29 14:06:43.777058
Title: ClawEnvKit: Automatic Environment Generation for Claw-Like Agents
Title（参考訳）: ClawEnvKit: Clawライクなエージェントのための環境自動生成
Authors: Xirui Li, Ming Li, Derry Xu, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou,
Abstract要約: 我々は、オンデマンドで検証された環境を生成することができる自動生成パイプラインであるClawEnvKitを紹介する。 ClawEnvKitは、(1)自然言語入力から構造化生成パラメータを抽出するパイプライン、(2)タスク仕様、ツールインターフェース、スコアリング設定を生成するジェネレータ、(3)実現可能性、多様性、構造的妥当性、内部整合性を強制するバリデータからなる。爪のようなエージェントの大規模なベンチマークであるAuto-ClawEvalを構築し、24のカテゴリで1,040の環境を網羅した。
参考スコア（独自算出の注目度）: 80.4926318403362
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.
Abstract（参考訳）: 爪のようなエージェントを訓練し評価するための環境を構築することは、手動で人間の集中的なプロセスであり、スケールしない。必要なのは単なるデータセットではなく、オンデマンドで多様な検証済み環境を生成する自動パイプラインである、と私たちは主張する。この目的のために、我々はClawEnvKitを紹介した。ClawEnvKitは、自然言語記述からこの形式をインスタンス化する自動生成パイプラインである。パイプラインは,(1)自然言語入力から構造化生成パラメータを抽出するパーサ,(2)タスク仕様,ツールインターフェース,スコアリング設定を生成するジェネレータ,(3)実行可能性,多様性,構造的妥当性,内部の整合性を強制するバリケータの3つのモジュールから構成される。 ClawEnvKitを使用して、24のカテゴリにわたる1,040の環境を含む、爪のようなエージェントのための最初の大規模ベンチマークであるAuto-ClawEvalを構築した。実証的に、Auto-ClawEvalは、コヒーレンスと明快さを13,800倍のコストで人為的な環境と一致させたり、超えたりします。 4つのモデルファミリと8つのエージェントハーネスフレームワークで評価され、ハーネスエンジニアリングは、素のReActベースライン上で最大15.7ポイントの性能向上、完了は、ベンチマークを飽和させるモデルなしで、変動の一次軸のままであり、自動生成は、これまで不可能だったスケールでの評価を可能にする。ユーザは自然言語で望ましい能力を記述し、オンデマンドで検証された環境を取得し、評価を継続的ユーザ駆動のプロセスに変換する。同じメカニズムがオンデマンドのトレーニング環境ジェネレータとして機能し、既存のユーザログに縛られるのではなく、エージェントの現在の弱点に適応するタスク分散を生成する。

論文の概要: ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

関連論文リスト