Fugu-MT 論文翻訳(概要): The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

論文の概要: The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

arxiv url: http://arxiv.org/abs/2603.15563v1
Date: Mon, 16 Mar 2026 17:25:42 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.690929
Title: The PokeAgent Challenge: Competitive and Long-Context Learning at Scale
Title（参考訳）: PokeAgent Challenge: 競争力と長期学習
Authors: Seth Karten, Jake Grigsby, Tersoo Upaa, Junik Bae, Seonghun Hong, Hyunyoung Jeong, Jaeyoon Jung, Kun Kerdthaisong, Gyungbo Kim, Hyeokgi Kim, Yujin Kim, Eunju Kwon, Dongyu Liu, Patrick Mariglia, Sangyeon Park, Benedikt Schink, Xianwei Shi, Anthony Sistilli, Joseph Twin, Arian Urdu, Matin Urdu, Qiao Wang, Ling Wu, Wenli Zhang, Kunsheng Zhou, Stephanie Milani, Kiran Vodrahalli, Amy Zhang, Fei Fang, Yuke Zhu, Chi Jin,
Abstract要約: PokeAgent Challengeは意思決定研究のための大規模なベンチマークだ。 Pokemonのマルチエージェントバトルシステムと拡張型ロールプレイングゲーム(RPG)環境上に構築されている。我々のNeurIPS 2025コンペティションは、私たちのリソースの品質と、Pokemonに対する研究コミュニティの関心の両方を検証します。
参考スコア（独自算出の注目度）: 45.224407977351824
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present the PokeAgent Challenge, a large-scale benchmark for decision-making research built on Pokemon's multi-agent battle system and expansive role-playing game (RPG) environment. Partial observability, game-theoretic reasoning, and long-horizon planning remain open problems for frontier AI, yet few benchmarks stress all three simultaneously under realistic conditions. PokeAgent targets these limitations at scale through two complementary tracks: our Battling Track, which calls for strategic reasoning and generalization under partial observability in competitive Pokemon battles, and our Speedrunning Track, which requires long-horizon planning and sequential decision-making in the Pokemon RPG. Our Battling Track supplies a dataset of 20M+ battle trajectories alongside a suite of heuristic, RL, and LLM-based baselines capable of high-level competitive play. Our Speedrunning Track provides the first standardized evaluation framework for RPG speedrunning, including an open-source multi-agent orchestration system for modular, reproducible comparisons of harness-based LLM approaches. Our NeurIPS 2025 competition validates both the quality of our resources and the research community's interest in Pokemon, with over 100 teams competing across both tracks and winning solutions detailed in our paper. Participant submissions and our baselines reveal considerable gaps between generalist (LLM), specialist (RL), and elite human performance. Analysis against the BenchPress evaluation matrix shows that Pokemon battling is nearly orthogonal to standard LLM benchmarks, measuring capabilities not captured by existing suites and positioning Pokemon as an unsolved benchmark that can drive RL and LLM research forward. We transition to a living benchmark with a live leaderboard for Battling and self-contained evaluation for Speedrunning at https://pokeagentchallenge.com.
Abstract（参考訳）: 本稿では,ポケモンの多エージェント戦闘システムとロールプレイングゲーム(RPG)環境上に構築された意思決定研究の大規模ベンチマークであるPokeAgent Challengeを紹介する。部分的可観測性、ゲーム理論的推論、長期計画などは、フロンティアAIの未解決の問題として残るが、現実的な条件下では、これら3つを同時に強調するベンチマークはほとんどない。 PokeAgentは、2つの補完的なトラックを通じてこれらの制限を大規模に対象としています。これは、競争力のあるポケモンバトルにおける部分観測可能性の下で戦略的推論と一般化を求めるBattling Trackと、ポケモンRPGにおける長期計画とシーケンシャル意思決定を必要とするSpeedrunning Trackです。私たちのBattling Trackは、高レベルの競争力を持つヒューリスティック、RL、LLMベースのベースラインとともに、20M以上の戦闘軌跡のデータセットを提供します。我々のSpeedrunning TrackはRPGのスピードランニングのための最初の標準化された評価フレームワークを提供し、オープンソースのマルチエージェントオーケストレーションシステムで、ハーネスベースのLLMアプローチの再現可能な比較を行う。私たちのNeurIPS 2025コンペティションは、私たちのリソースの品質と、Pokemonに対する研究コミュニティの関心の両方を検証するものです。参加者の投稿とベースラインは、ジェネラリスト(LLM)、スペシャリスト(RL)、そしてエリートな人間のパフォーマンスの間にかなりのギャップがあることを明らかにする。 BenchPress 評価行列の解析から,Pokemon のバトリングは標準的な LLM ベンチマークとほぼ直交しており,既存のスイートでは捕捉されていない能力の測定や,RL と LLM の研究を前進させる未解決ベンチマークとしての Pokemon の位置づけが示されている。我々は、Battlingのライブリーダボードと、https://pokeagentchallenge.com.comでSpeedrunningの自己完結型評価を備えた、生きたベンチマークに移行する。

論文の概要: The PokeAgent Challenge: Competitive and Long-Context Learning at Scale

関連論文リスト