Fugu-MT 論文翻訳(概要): PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

論文の概要: PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

arxiv url: http://arxiv.org/abs/2509.06053v1
Date: Sun, 07 Sep 2025 13:33:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-09 14:07:03.836799
Title: PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training
Title（参考訳）: PolicyEvolve: 人口ベーストレーニングによるマルチプレイヤーゲームのためのLCMによるプログラムポリシーの展開
Authors: Mingrui Lv, Hangzhi Liu, Zhi Luo, Hongjie Zhang, Jie Ou,
Abstract要約: PolicyEvolveはマルチプレイヤーゲームでプログラムポリシーを生成するためのフレームワークである。これは、手作業によるポリシーコードへの依存を減らし、最小限の環境相互作用で高性能なポリシーを実現する。グローバルプールから上位3つのポリシーをサンプリングし、環境情報に基づいて現在のイテレーションの初期ポリシーを生成し、軌道批判からのフィードバックを使ってこのポリシーを洗練します。
参考スコア（独自算出の注目度）: 4.5232365105005155
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Multi-agent reinforcement learning (MARL) has achieved significant progress in solving complex multi-player games through self-play. However, training effective adversarial policies requires millions of experience samples and substantial computational resources. Moreover, these policies lack interpretability, hindering their practical deployment. Recently, researchers have successfully leveraged Large Language Models (LLMs) to generate programmatic policies for single-agent tasks, transforming neural network-based policies into interpretable rule-based code with high execution efficiency. Inspired by this, we propose PolicyEvolve, a general framework for generating programmatic policies in multi-player games. PolicyEvolve significantly reduces reliance on manually crafted policy code, achieving high-performance policies with minimal environmental interactions. The framework comprises four modules: Global Pool, Local Pool, Policy Planner, and Trajectory Critic. The Global Pool preserves elite policies accumulated during iterative training. The Local Pool stores temporary policies for the current iteration; only sufficiently high-performing policies from this pool are promoted to the Global Pool. The Policy Planner serves as the core policy generation module. It samples the top three policies from the Global Pool, generates an initial policy for the current iteration based on environmental information, and refines this policy using feedback from the Trajectory Critic. Refined policies are then deposited into the Local Pool. This iterative process continues until the policy achieves a sufficiently high average win rate against the Global Pool, at which point it is integrated into the Global Pool. The Trajectory Critic analyzes interaction data from the current policy, identifies vulnerabilities, and proposes directional improvements to guide the Policy Planner
Abstract（参考訳）: マルチエージェント強化学習 (MARL) は, 複雑なマルチプレイヤーゲームにおいて, 自己学習によって大きな進歩を遂げている。しかし、効果的な敵政策の訓練には、何百万もの経験サンプルとかなりの計算資源が必要である。さらに、これらのポリシーは解釈可能性に欠け、実践的な展開を妨げる。最近、研究者はLarge Language Models(LLM)を利用して単一エージェントタスクのプログラムポリシーを生成し、ニューラルネットワークベースのポリシーを高い実行効率で解釈可能なルールベースコードに変換することに成功した。そこで本研究では,マルチプレイヤーゲームにおけるプログラムポリシー生成のための一般的なフレームワークであるPolicyEvolveを提案する。 PolicyEvolveは、手作業によるポリシーコードへの依存を著しく減らし、環境相互作用を最小限に抑えた高性能なポリシーを実現する。フレームワークはGlobal Pool, Local Pool, Policy Planner, Trajectory Criticの4つのモジュールで構成されている。グローバルプールは、反復訓練中に蓄積されたエリート政策を保護している。ローカルプールは、現在のイテレーションの一時的なポリシーを格納しており、このプールからの十分なハイパフォーマンスなポリシーだけがグローバルプールに昇格している。ポリシープランナーは、コアポリシー生成モジュールとして機能する。グローバルプールから上位3つのポリシーをサンプリングし、環境情報に基づいて現在のイテレーションの初期ポリシーを生成し、軌道批判からのフィードバックを使ってこのポリシーを洗練します。精錬された政策はその後、地方プールに預けられる。この反復的なプロセスは、政策がグローバルプールに対して十分に高い平均的な勝利率を達成するまで継続し、その時点でグローバルプールに統合される。 Trajectory Criticは、現在のポリシーからインタラクションデータを分析し、脆弱性を特定し、ポリシープランナーを導くための方向性の改善を提案している。

論文の概要: PolicyEvolve: Evolving Programmatic Policies by LLMs for multi-player games via Population-Based Training

関連論文リスト