Fugu-MT 論文翻訳(概要): Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

論文の概要: Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

arxiv url: http://arxiv.org/abs/2405.20165v2
Date: Thu, 31 Oct 2024 05:14:55 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-28 17:07:33.057184
Title: Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation
Title（参考訳）: 多項ロジスティック関数近似を用いた強化学習のためのランダム化探索
Authors: Wooseong Cho, Taehyun Hwang, Joongkyu Lee, Min-hwan Oh,
Abstract要約: 多項ロジスティック(MNL)関数近似を用いた強化学習について検討した。頻繁な後悔の保証を有するランダムな探索を伴う確率的効率のアルゴリズムを提案する。数値実験により提案アルゴリズムの優れた性能を示す。
参考スコア（独自算出の注目度）: 8.274693573069442
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We study reinforcement learning with multinomial logistic (MNL) function approximation where the underlying transition probability kernel of the Markov decision processes (MDPs) is parametrized by an unknown transition core with features of state and action. For the finite horizon episodic setting with inhomogeneous state transitions, we propose provably efficient algorithms with randomized exploration having frequentist regret guarantees. For our first algorithm, $\texttt{RRL-MNL}$, we adapt optimistic sampling to ensure the optimism of the estimated value function with sufficient frequency. We establish that $\texttt{RRL-MNL}$ achieves a $\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ frequentist regret bound with constant-time computational cost per episode. Here, $d$ is the dimension of the transition core, $H$ is the horizon length, $T$ is the total number of steps, and $\kappa$ is a problem-dependent constant. Despite the simplicity and practicality of $\texttt{RRL-MNL}$, its regret bound scales with $\kappa^{-1}$, which is potentially large in the worst case. To improve the dependence on $\kappa^{-1}$, we propose $\texttt{ORRL-MNL}$, which estimates the value function using the local gradient information of the MNL transition model. We show that its frequentist regret bound is $\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$. To the best of our knowledge, these are the first randomized RL algorithms for the MNL transition model that achieve statistical guarantees with constant-time computational cost per episode. Numerical experiments demonstrate the superior performance of the proposed algorithms.
Abstract（参考訳）: 我々は,マルコフ決定過程(MDP)の基底となる遷移確率核が,状態と動作の特徴を持つ未知の遷移コアによってパラメータ化されるような,MNL関数近似を用いた強化学習について検討した。不均質な状態遷移を伴う有限地平線エピソディクス設定に対しては、頻繁な後悔の保証を有するランダムな探索を伴う確率的に効率的なアルゴリズムを提案する。最初のアルゴリズムである$\texttt{RRL-MNL}$に対して、十分な周波数で推定値関数の楽観性を確保するために楽観的なサンプリングを適用する。我々は、$\texttt{RRL-MNL}$が$\tilde{O}(\kappa^{-1} d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T})$ oftenist regret bound with constant-time compute cost per episodeを達成できることを確立する。ここで、$d$は遷移コアの次元、$H$は地平線の長さ、$T$はステップの総数、$\kappa$は問題依存定数である。 $\texttt{RRL-MNL}$の単純さと実用性にもかかわらず、その後悔は$\kappa^{-1}$とスケールする。 MNL遷移モデルの局所勾配情報を用いて値関数を推定する$\texttt{ORRL-MNL}$を提案する。頻繁な後悔境界は$\tilde{O}(d^{\frac{3}{2}} H^{\frac{3}{2}} \sqrt{T} + \kappa^{-1} d^2 H^2)$であることを示す。我々の知る限り、これらはMNLトランジションモデルのための最初のランダム化RLアルゴリズムであり、1エピソード当たりの計算コストを一定に抑える統計的保証を実現する。数値実験により提案アルゴリズムの優れた性能を示す。

論文の概要: Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation

関連論文リスト