Fugu-MT 論文翻訳(概要): Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

論文の概要: Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

arxiv url: http://arxiv.org/abs/2605.06869v2
Date: Tue, 12 May 2026 18:33:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-14 17:13:58.783644
Title: Agentick: A Unified Benchmark for General Sequential Decision-Making Agents
Title（参考訳）: Agentick: 汎用シークエンシャル意思決定エージェントのための統一ベンチマーク
Authors: Roger Creus Castanyer, Pablo Samuel Castro, Glen Berseth,
Abstract要約: Agentickはシーケンシャルな意思決定エージェントのベンチマークである。プロシージャで生成されたタスクは6つの機能カテゴリ、難易度レベル4、観察モード5で37になる。 27のコンフィグレーションと90,000以上のエピソードにまたがる評価では、単一のアプローチが支配的でないことが示されている。
参考スコア（独自算出の注目度）: 30.028388632526745
License: http://creativecommons.org/licenses/by/4.0/
Abstract: AI agent research spans a wide spectrum: from RL agents that learn from scratch to foundation model agents that leverage pre-trained knowledge, yet no unified benchmark enables fair comparison across these approaches. We present Agentick, a benchmark for sequential decision-making agents designed to evaluate RL, LLM, VLM, hybrid, and human agents on common ground and to power research on the fundamental challenges of sequential decision-making. Agentick provides 37 procedurally generated tasks across six capability categories, four difficulty levels, and five observation modalities, all exposed through a single Gymnasium-compatible interface. The benchmark ships with a Coding API, oracle reference policies for all tasks, pre-built SFT datasets, a composable agent harness, and a live leaderboard. An evaluation spanning 27 configurations and over 90,000 episodes reveals that no single approach dominates: GPT-5 mini leads overall at 0.309 oracle-normalized score while PPO dominates planning and multi-agent tasks; the reasoning harness multiplies LLM performance by 3-10x; and ASCII observations consistently outperform natural language. These findings highlight the substantial room for improvement that remains across all agent paradigms. Agentick's capability-decomposed, multi-modal design provides the empirical infrastructure needed to drive progress toward general autonomous agents, both as an evaluation framework and as a training ground for RL post-training of foundation models in truly sequential environments.
Abstract（参考訳）: AIエージェントの研究は、スクラッチから学習するRLエージェントから、事前訓練された知識を活用する基礎モデルエージェントまで、幅広い範囲にまたがっている。本稿では, RL, LLM, VLM, ハイブリッド, 人為的エージェントを共通基盤上で評価し, シーケンシャル意思決定の根本的な課題について研究を行うための, シーケンシャル意思決定エージェントのベンチマークであるAgentickを提案する。 Agentickは6つの機能カテゴリに37の手続き的に生成されたタスク、難易度レベル4つ、観察モード5つを提供し、いずれも単一のGymnasium互換インターフェースを通じて公開されている。ベンチマークには、Coding API、すべてのタスクのためのオラクル参照ポリシ、ビルド済みのSFTデータセット、構成可能なエージェントハーネス、ライブのリーダボードが付属している。 GPT-5 miniは全体の0.309オラクレ正規化スコアでリードし、PPOは計画とマルチエージェントタスクをリードし、推論ハーネスはLLMのパフォーマンスを3～10倍、ASCII観測は一貫して自然言語を上回ります。これらの知見は、すべてのエージェントパラダイムに残る改善の余地を浮き彫りにしている。 Agentickの能力分割型マルチモーダルデザインは、評価フレームワークと、真にシーケンシャルな環境における基礎モデルのRL後トレーニングのためのトレーニンググラウンドの両方として、一般的な自律エージェントへの進捗を促進するために必要な経験的なインフラを提供する。

論文の概要: Agentick: A Unified Benchmark for General Sequential Decision-Making Agents

関連論文リスト