論文の概要: Finding good policies in average-reward Markov Decision Processes without prior knowledge
- arxiv url: http://arxiv.org/abs/2405.17108v1
- Date: Mon, 27 May 2024 12:24:14 GMT
- Title: Finding good policies in average-reward Markov Decision Processes without prior knowledge
- Title(参考訳): 事前知識のない平均回帰マルコフ決定過程における良い政策の発見
- Authors: Adrienne Tuynman, Rémy Degenne, Emilie Kaufmann,
- Abstract要約: 平均回帰決定(MDP)における$varepsilon$-Optimal Policyの同定を再考する。
- 参考スコア(独自算出の注目度): 19.89784209009327
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We revisit the identification of an $\varepsilon$-optimal policy in average-reward Markov Decision Processes (MDP). In such MDPs, two measures of complexity have appeared in the literature: the diameter, $D$, and the optimal bias span, $H$, which satisfy $H\leq D$. Prior work have studied the complexity of $\varepsilon$-optimal policy identification only when a generative model is available. In this case, it is known that there exists an MDP with $D \simeq H$ for which the sample complexity to output an $\varepsilon$-optimal policy is $\Omega(SAD/\varepsilon^2)$ where $S$ and $A$ are the sizes of the state and action spaces. Recently, an algorithm with a sample complexity of order $SAH/\varepsilon^2$ has been proposed, but it requires the knowledge of $H$. We first show that the sample complexity required to estimate $H$ is not bounded by any function of $S,A$ and $H$, ruling out the possibility to easily make the previous algorithm agnostic to $H$. By relying instead on a diameter estimation procedure, we propose the first algorithm for $(\varepsilon,\delta)$-PAC policy identification that does not need any form of prior knowledge on the MDP. Its sample complexity scales in $SAD/\varepsilon^2$ in the regime of small $\varepsilon$, which is near-optimal. In the online setting, our first contribution is a lower bound which implies that a sample complexity polynomial in $H$ cannot be achieved in this setting. Then, we propose an online algorithm with a sample complexity in $SAD^2/\varepsilon^2$, as well as a novel approach based on a data-dependent stopping rule that we believe is promising to further reduce this bound.
- Abstract(参考訳): 我々は、平均回帰マルコフ決定過程(MDP)における$\varepsilon$-optimal Policyの同定を再考する。
そのようなMDPでは、直径、$D$、最適バイアス幅、$H$という2つの複雑さの尺度が文献に現れており、これは$H\leq D$を満たす。
以前の研究は、生成モデルが利用可能である場合にのみ、$\varepsilon$-Optimal Policy IDの複雑さについて研究してきた。
この場合、$D \simeq H$ の MDP が存在し、$\varepsilon$-optimal policy を出力するサンプルの複雑さは $\Omega(SAD/\varepsilon^2)$ であり、$S$ と $A$ は状態空間と行動空間のサイズである。
