Fugu-MT 論文翻訳(概要): Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

論文の概要: Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation

arxiv url: http://arxiv.org/abs/2405.17061v1
Date: Mon, 27 May 2024 11:31:54 GMT
ステータス: 翻訳完了
システム内更新日: 2024-05-28 15:42:27.339944
Title: Provably Efficient Reinforcement Learning with Multinomial Logit Function Approximation
Title（参考訳）: 多項ロジット関数近似を用いた高能率強化学習
Authors: Long-Fei Li, Yu-Jie Zhang, Peng Zhao, Zhi-Hua Zhou,
Abstract要約: 本稿では,MNL関数近似を用いたMDPの新しいクラスについて検討し,状態空間上の確率分布の正当性を保証する。その利点にもかかわらず、非線形関数近似を導入することは、計算効率と統計効率の両方において大きな課題を提起する。
参考スコア（独自算出の注目度）: 67.8414514524356
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a new class of MDPs that employs multinomial logit (MNL) function approximation to ensure valid probability distributions over the state space. Despite its benefits, introducing non-linear function approximation raises significant challenges in both computational and statistical efficiency. The best-known method of Hwang and Oh [2023] has achieved an $\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent quantity, $d$ is the feature space dimension, $H$ is the episode length, and $K$ is the number of episodes. While this result attains the same rate in $K$ as the linear cases, the method requires storing all historical data and suffers from an $\mathcal{O}(K)$ computation cost per episode. Moreover, the quantity $\kappa$ can be exponentially small, leading to a significant gap for the regret compared to the linear cases. In this work, we first address the computational concerns by proposing an online algorithm that achieves the same regret with only $\mathcal{O}(1)$ computation cost. Then, we design two algorithms that leverage local information to enhance statistical efficiency. They not only maintain an $\mathcal{O}(1)$ computation cost per episode but achieve improved regrets of $\widetilde{\mathcal{O}}(\kappa^{-1/2}dH^2\sqrt{K})$ and $\widetilde{\mathcal{O}}(dH^2\sqrt{K} + \kappa^{-1}d^2H^2)$ respectively. Finally, we establish a lower bound, justifying the optimality of our results in $d$ and $K$. To the best of our knowledge, this is the first work that achieves almost the same computational and statistical efficiency as linear function approximation while employing non-linear function approximation for reinforcement learning.
Abstract（参考訳）: 本稿では,MNL関数近似を用いたMDPの新しいクラスについて検討し,状態空間上の確率分布の正当性を保証する。その利点にもかかわらず、非線形関数近似を導入することは、計算効率と統計効率の両方において大きな課題を提起する。 Hwang と Oh [2023] の最もよく知られている方法は、$\widetilde{\mathcal{O}}(\kappa^{-1}dH^2\sqrt{K})$ regret, where $\kappa$ is a problem-dependent amount, $d$ is the feature space dimension, $H$ is the episode length, $K$ is the number of episodes。この結果は、線形の場合と同じ$Kで達成されるが、この方法はすべての履歴データを保存し、エピソード毎に$\mathcal{O}(K)$の計算コストに悩まされる。さらに、$\kappa$ の量は指数関数的に小さくなり、線形の場合と比較して後悔に対する大きなギャップが生じる。本研究は, オンラインアルゴリズムを用いて, 計算コストを$\mathcal{O}(1)$$に抑えることで, 計算上の問題に対処するものである。そこで我々は,統計的効率を高めるために,局所的な情報を活用する2つのアルゴリズムを設計する。さらに、$\widetilde{\mathcal{O}}(\kappa^{-1/2}dH^2\sqrt{K})$と$\widetilde{\mathcal{O}}(dH^2\sqrt{K} + \kappa^{-1}d^2H^2)$をそれぞれ改善した後悔を実現する。最後に、より低い境界を確立し、結果の最適性を$d$と$K$で正当化する。我々の知る限りでは、強化学習に非線形関数近似を用いながら線形関数近似とほぼ同じ計算効率と統計的効率を達成する最初の研究である。

関連論文リスト

Enjoying Non-linearity in Multinomial Logistic Bandits [66.36004256710824]
我々は,学習者が環境と対話する一般化線形帯域幅の変種である多項ロジスティック帯域幅問題を考察する。我々は、$kappa_*$の定義を多項集合に拡張し、問題の非線形性を活用する効率的なアルゴリズムを提案する。我々のメソッドは、代入$ smashwidetildemathcalO(Kd sqrtT/kappa_*) $, $K$はアクションの数で、$kappa_*
論文参考訳（メタデータ） (2025-07-07T08:18:25Z)
Improved Regret Bound for Safe Reinforcement Learning via Tighter Cost Pessimism and Reward Optimism [1.4999444543328293]
本稿では,新しいコストと報酬関数推定器に基づくモデルベースアルゴリズムを提案する。我々のアルゴリズムは、$widetildemathcalO((bar C - bar C_b)-1H2.5 SsqrtAK)$の残念な上限を達成する。
論文参考訳（メタデータ） (2024-10-14T04:51:06Z)
Sharper Bounds for Chebyshev Moment Matching, with Applications [26.339024618084476]
本研究では,チェビシェフモーメントの雑音測定により,確率分布をほぼ復元する問題について検討する。任意のリプシッツ函数のチェビシェフ展開の係数に束縛された大域的崩壊を利用することで、ワッサーシュタイン距離の正確な回復が以前よりも多くのノイズで可能であることを証明できる。
論文参考訳（メタデータ） (2024-08-22T13:26:41Z)
Randomized Exploration for Reinforcement Learning with Multinomial Logistic Function Approximation [8.274693573069442]
多項ロジスティック(MNL)関数近似を用いた強化学習について検討した。頻繁な後悔の保証を有するランダムな探索を伴う確率的効率のアルゴリズムを提案する。数値実験により提案アルゴリズムの優れた性能を示す。
論文参考訳（メタデータ） (2024-05-30T15:39:19Z)
Tackling Heavy-Tailed Rewards in Reinforcement Learning with Function Approximation: Minimax Optimal and Instance-Dependent Regret Bounds [26.277745106128197]
本研究では,線形関数近似を用いた強化学習におけるそのような報奨の課題に対処する。我々はまず,重み付き線形包帯に対するtextscHeavy-OFUL というアルゴリズムを設計し,インセンス依存の$T$-round regret of $tildeObig を実現した。我々の結果は、オンライン回帰問題全般において、重くノイズを扱うことに独立した関心を持つような、新しい自己正規化集中不等式によって達成される。
論文参考訳（メタデータ） (2023-06-12T02:56:09Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
本稿では,一般値関数近似を用いた効率の良い強化学習アルゴリズムを提案する。本アルゴリズムは, 線形設定と疎高次元線形設定の両方に適用した場合に, 合理的な後悔境界を達成できる。
論文参考訳（メタデータ） (2023-02-22T20:21:25Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
我々は,損失関数が約1,300ドル以上のエピソードに対して任意に変化するような,敵対的決定過程(MDP)の学習を検討する。本稿では,同じ設定で$tildemathcal O(K2/3)$に対する後悔を改善する2つのアルゴリズムを提案する。
論文参考訳（メタデータ） (2023-01-30T14:37:21Z)
Nearly Minimax Optimal Reinforcement Learning with Linear Function Approximation [25.60689712525918]
本稿では,遷移確率と報酬関数が線形な線形関数近似を用いた強化学習について検討する。本稿では,新たなアルゴリズムLSVI-UCB$+$を提案し,$H$がエピソード長,$d$が特徴次元,$T$がステップ数である場合に,$widetildeO(HdsqrtT)$ regretboundを実現する。
論文参考訳（メタデータ） (2022-06-23T06:04:21Z)
TURF: A Two-factor, Universal, Robust, Fast Distribution Learning Algorithm [64.13217062232874]
最も強力で成功したモダリティの1つは、全ての分布を$ell$距離に近似し、基本的に最も近い$t$-piece次数-$d_$の少なくとも1倍大きい。本稿では,この数値をほぼ最適に推定する手法を提案する。
論文参考訳（メタデータ） (2022-02-15T03:49:28Z)
Improved No-Regret Algorithms for Stochastic Shortest Path with Linear MDP [31.62899359543925]
線形MDPを用いた最短経路問題(SSP)に対する2つの新しい非回帰アルゴリズムを提案する。我々の最初のアルゴリズムは計算効率が高く、後悔すべき$widetildeOleft(sqrtd3B_star2T_star Kright)$を達成している。第2のアルゴリズムは計算的に非効率であるが、$T_starに依存しない$widetildeO(d3.5B_starsqrtK)$の最初の「水平な」後悔を実現する。
論文参考訳（メタデータ） (2021-12-18T06:47:31Z)
Provably Breaking the Quadratic Error Compounding Barrier in Imitation Learning, Optimally [58.463668865380946]
状態空間 $mathcalS$ を用いたエピソードマルコフ決定過程 (MDPs) における模擬学習の統計的限界について検討する。 rajaraman et al (2020) におけるmdアルゴリズムを用いた準最適性に対する上限 $o(|mathcals|h3/2/n)$ を定式化する。 Omega(H3/2/N)$ $mathcalS|geq 3$ であるのに対して、未知の遷移条件はよりシャープレートに悩まされる。
論文参考訳（メタデータ） (2021-02-25T15:50:19Z)
Nearly Optimal Regret for Learning Adversarial MDPs with Linear Function Approximation [92.3161051419884]
我々は、敵対的な報酬と完全な情報フィードバックで有限正方体エピソディックマルコフ決定プロセスのための強化学習を研究します。我々は、$tildeO(dHsqrtT)$ regretを達成できることを示し、$H$はエピソードの長さである。また、対数因子までの$tildeOmega(dHsqrtT)$の値が一致することを証明する。
論文参考訳（メタデータ） (2021-02-17T18:54:08Z)
Nearly Minimax Optimal Reinforcement Learning for Linear Mixture Markov Decision Processes [91.38793800392108]
本稿では,マルコフ決定過程(MDP)の遷移確率核が線形混合モデルである線形関数近似による強化学習について検討する。上記の線形混合 MDP に対して$textUCRL-VTR+$ という線形関数近似を用いた計算効率の良い新しいアルゴリズムを提案する。我々の知る限り、これらは線形関数近似を持つRLのための計算効率が良く、ほぼ最小のアルゴリズムである。
論文参考訳（メタデータ） (2020-12-15T18:56:46Z)
Convergence of Sparse Variational Inference in Gaussian Processes Regression [29.636483122130027]
計算コストが$mathcalO(log N)2D(log N)2)$の手法を推論に利用できることを示す。
論文参考訳（メタデータ） (2020-08-01T19:23:34Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。