Fugu-MT 論文翻訳(概要): Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

論文の概要: Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

arxiv url: http://arxiv.org/abs/2604.02353v1
Date: Wed, 04 Mar 2026 22:51:22 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.331496
Title: Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning
Title（参考訳）: Prism:強化学習における解釈可能な戦略マッピングによる政策再利用
Authors: Thomas Pravetz,
Abstract要約: PRISMは、強化学習エージェントの判断を、個別かつ因果的に検証された概念に基礎付けるフレームワークである。 PRISMは各エージェントのエンコーダをK-means経由で$K$のコンセプトにクラスタする。概念は戦略を因果的にエンコードするので、最適な二部マッチングを通じてそれらを整列させることは戦略的知識をゼロショットにする。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present PRISM (Policy Reuse via Interpretable Strategy Mapping), a framework that grounds reinforcement learning agents' decisions in discrete, causally validated concepts and uses those concepts as a zero-shot transfer interface between agents trained with different algorithms. PRISM clusters each agent's encoder features into $K$ concepts via K-means. Causal intervention establishes that these concepts directly drive - not merely correlate with - agent behavior: overriding concept assignments changes the selected action in 69.4% of interventions ($p = 8.6 \times 10^{-86}$, 2500 interventions). Concept importance and usage frequency are dissociated: the most-used concept (C47, 33.0% frequency) causes only a 9.4% win-rate drop when ablated, while ablating C16 (15.4% frequency) collapses win rate from 100% to 51.8%. Because concepts causally encode strategy, aligning them via optimal bipartite matching transfers strategic knowledge zero-shot. On Go~7$\times$7 with three independently trained agents, concept transfer achieves 69.5%$\pm$3.2% and 76.4%$\pm$3.4% win rate against a standard engine across the two successful transfer pairs (10 seeds), compared to 3.5% for a random agent and 9.2% without alignment. Transfer succeeds when the source policy is strong; geometric alignment quality predicts nothing ($R^2 \approx 0$). The framework is scoped to domains where strategic state is naturally discrete: the identical pipeline on Atari Breakout yields bottleneck policies at random-agent performance, confirming that the Go results reflect a structural property of the domain.
Abstract（参考訳）: PRISM(Policy Reuse via Interpretable Strategy Mapping)は、強化学習エージェントの判断を個別に因果的に検証し、異なるアルゴリズムで訓練されたエージェント間のゼロショット転送インターフェースとして使用するフレームワークである。 PRISMは各エージェントのエンコーダをK-means経由で$K$のコンセプトにクラスタする。因果的介入 (Causal intervention) は、これらの概念が直接的(単にエージェントの行動と相関するわけではない)であることを示す: オーバーライドされた概念割り当ては、69.4%の介入(p = 8.6 \times 10^{-86}$, 2500の介入)で選択された行動を変更する。最もよく使われる概念(C47, 33.0%の周波数)は、アブレーション時に9.4%の勝利率低下しか起こさないのに対し、C16(15.4%の周波数)は100%から51.8%の勝利率で崩壊する。概念は戦略を因果的にエンコードするので、最適な二部マッチングを通じてそれらを整列させることは戦略的知識をゼロショットにする。 3つの独立したエージェントを持つGo~7$\times$7では、コンセプトトランスファーは69.5%$\pm$3.2%、76.4%$\pm$3.4%で、2つの成功したトランスファーペア(10種)の標準エンジンに対して勝利し、ランダムエージェントは3.5%、アライメントなしで9.2%となっている。転送はソースポリシーが強いときに成功し、幾何学的アライメントの品質は何も予測しない(R^2 \approx 0$)。 Atari Breakoutの同一パイプラインはランダムエージェントのパフォーマンスでボトルネックポリシーを生成し、Goの結果がドメインの構造的特性を反映していることを確認する。

論文の概要: Prism: Policy Reuse via Interpretable Strategy Mapping in Reinforcement Learning

関連論文リスト