Fugu-MT 論文翻訳(概要): Policy Testing in Markov Decision Processes

論文の概要: Policy Testing in Markov Decision Processes

arxiv url: http://arxiv.org/abs/2505.15342v1
Date: Wed, 21 May 2025 10:13:54 GMT
ステータス: 翻訳完了
システム内更新日: 2025-05-22 15:42:59.501634
Title: Policy Testing in Markov Decision Processes
Title（参考訳）: マルコフ決定過程における政策試験
Authors: Kaito Ariu, Po-An Wang, Alexandre Proutiere, Kenshi Abe,
Abstract要約: 本研究では,不確実性条件下での割引決定プロセス(MDP)におけるポリシーテスト問題について検討する。目的は、与えられたポリシーの値が数値しきい値を超えるかどうかを決定することである。
参考スコア（独自算出の注目度）: 48.642181362172906
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study the policy testing problem in discounted Markov decision processes (MDPs) under the fixed-confidence setting. The goal is to determine whether the value of a given policy exceeds a specified threshold while minimizing the number of observations. We begin by deriving an instance-specific lower bound that any algorithm must satisfy. This lower bound is characterized as the solution to an optimization problem with non-convex constraints. We propose a policy testing algorithm inspired by this optimization problem--a common approach in pure exploration problems such as best-arm identification, where asymptotically optimal algorithms often stem from such optimization-based characterizations. As for other pure exploration tasks in MDPs, however, the non-convex constraints in the lower-bound problem present significant challenges, raising doubts about whether statistically optimal and computationally tractable algorithms can be designed. To address this, we reformulate the lower-bound problem by interchanging the roles of the objective and the constraints, yielding an alternative problem with a non-convex objective but convex constraints. Strikingly, this reformulated problem admits an interpretation as a policy optimization task in a newly constructed reversed MDP. Leveraging recent advances in policy gradient methods, we efficiently solve this problem and use it to design a policy testing algorithm that is statistically optimal--matching the instance-specific lower bound on sample complexity--while remaining computationally tractable. We validate our approach with numerical experiments.
Abstract（参考訳）: 本研究では,不確実性条件下でのマルコフ決定過程(MDP)における政策テスト問題について検討する。目的は、所定のポリシーの値が特定のしきい値を超えるかどうかを判断し、観測回数を最小限に抑えることである。まず、任意のアルゴリズムが満たさなければならないインスタンス固有の下界を導出する。この下界は非凸制約による最適化問題の解として特徴づけられる。本稿では,この最適化問題にインスパイアされたポリシテストアルゴリズムを提案する。しかし、MDPにおける他の純粋な探索作業に関して、下界問題における非凸制約は重大な課題を示し、統計的に最適か計算的に抽出可能なアルゴリズムを設計できるかという疑念を提起する。この問題に対処するために、目的と制約の役割を交換し、非凸目的と凸制約との代替問題を導出することにより、下界問題を再構築する。興味深いことに、この改革された問題は、新たに構築された逆MDPにおいて、ポリシー最適化タスクとして解釈される。近年の政策勾配法の進歩を生かして、この問題を効率的に解き、統計学的に最適である政策試験アルゴリズムを設計する。数値実験によるアプローチを検証する。

論文の概要: Policy Testing in Markov Decision Processes

関連論文リスト