Fugu-MT 論文翻訳(概要): Achieve Performatively Optimal Policy for Performative Reinforcement Learning

論文の概要: Achieve Performatively Optimal Policy for Performative Reinforcement Learning

arxiv url: http://arxiv.org/abs/2510.04430v1
Date: Mon, 06 Oct 2025 01:56:31 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-07 16:52:59.645383
Title: Achieve Performatively Optimal Policy for Performative Reinforcement Learning
Title（参考訳）: 適応的強化学習のための適応的最適政策の達成
Authors: Ziyi Chen, Heng Huang,
Abstract要約: 本研究は,0階次FrankWolfe- (0FW) アルゴリズムを提案する。実験結果から, 所望のPOポリシを求める場合, 既存の近似よりも0FWの方が有効であることが示唆された。
参考スコア（独自算出の注目度）: 55.983627302691424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Performative reinforcement learning is an emerging dynamical decision making framework, which extends reinforcement learning to the common applications where the agent's policy can change the environmental dynamics. Existing works on performative reinforcement learning only aim at a performatively stable (PS) policy that maximizes an approximate value function. However, there is a provably positive constant gap between the PS policy and the desired performatively optimal (PO) policy that maximizes the original value function. In contrast, this work proposes a zeroth-order Frank-Wolfe algorithm (0-FW) algorithm with a zeroth-order approximation of the performative policy gradient in the Frank-Wolfe framework, and obtains \textbf{the first polynomial-time convergence to the desired PO} policy under the standard regularizer dominance condition. For the convergence analysis, we prove two important properties of the nonconvex value function. First, when the policy regularizer dominates the environmental shift, the value function satisfies a certain gradient dominance property, so that any stationary point (not PS) of the value function is a desired PO. Second, though the value function has unbounded gradient, we prove that all the sufficiently stationary points lie in a convex and compact policy subspace $\Pi_{\Delta}$, where the policy value has a constant lower bound $\Delta>0$ and thus the gradient becomes bounded and Lipschitz continuous. Experimental results also demonstrate that our 0-FW algorithm is more effective than the existing algorithms in finding the desired PO policy.
Abstract（参考訳）: 適応的強化学習(Performative reinforcement learning)は、エージェントのポリシーが環境力学を変えることができる一般的なアプリケーションに強化学習を拡張する、動的意思決定フレームワークである。実演強化学習に関する既存の研究は、近似値関数を最大化する実演安定(PS)ポリシーのみを対象としている。しかし、PSポリシーと、元の値関数を最大化する所望のパフォーマンス最適(PO)ポリシーの間には、確実に正のギャップがある。これとは対照的に、Frank-Wolfe フレームワークにおけるパフォーマンスポリシー勾配のゼロ階近似を用いたゼロ階Frank-Wolfe アルゴリズム (0-FW) を提案し、標準正規化器支配条件の下で所望のPO} ポリシーに対する最初の多項式時間収束性を得る。収束解析では、非凸値関数の2つの重要な性質を証明している。まず、ポリシー正則化器が環境シフトを支配するとき、値関数は一定の勾配支配性を満たすので、値関数の静止点(PSではない)が所望のPOとなる。第二に、値関数は非有界勾配を持つが、十分定常なすべての点は凸かつコンパクトなポリシー部分空間 $\Pi_{\Delta}$ にあることを証明している。また, 提案アルゴリズムは, 所望のPOポリシの探索において, 既存のアルゴリズムよりも有効であることを示す。

論文の概要: Achieve Performatively Optimal Policy for Performative Reinforcement Learning

関連論文リスト