Fugu-MT 論文翻訳(概要): Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

論文の概要: Reward Biased Maximum Likelihood Estimation for Reinforcement Learning

arxiv url: http://arxiv.org/abs/2011.07738v3
Date: Sat, 15 May 2021 20:47:58 GMT
ステータス: 翻訳完了
システム内更新日: 2022-09-25 01:27:26.638791
Title: Reward Biased Maximum Likelihood Estimation for Reinforcement Learning
Title（参考訳）: 強化学習のためのReward Biased Maximum Likelihood Estimation
Authors: Akshay Mete, Rahul Singh, Xi Liu and P. R. Kumar
Abstract要約: マルコフ連鎖の適応制御のためのRBMLE(Reward-Biased Maximum Likelihood Estimate)を提案した。我々は、現在最先端のアルゴリズムと同様に、$mathcalO( log T)$が$T$の時間的水平線上で後悔していることを示します。
参考スコア（独自算出の注目度）: 13.820705458648233
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Reward-Biased Maximum Likelihood Estimate (RBMLE) for adaptive control of Markov chains was proposed to overcome the central obstacle of what is variously called the fundamental "closed-identifiability problem" of adaptive control, the "dual control problem", or, contemporaneously, the "exploration vs. exploitation problem". It exploited the key observation that since the maximum likelihood parameter estimator can asymptotically identify the closed-transition probabilities under a certainty equivalent approach, the limiting parameter estimates must necessarily have an optimal reward that is less than the optimal reward attainable for the true but unknown system. Hence it proposed a counteracting reverse bias in favor of parameters with larger optimal rewards, providing a solution to the fundamental problem alluded to above. It thereby proposed an optimistic approach of favoring parameters with larger optimal rewards, now known as "optimism in the face of uncertainty". The RBMLE approach has been proved to be long-term average reward optimal in a variety of contexts. However, modern attention is focused on the much finer notion of "regret", or finite-time performance. Recent analysis of RBMLE for multi-armed stochastic bandits and linear contextual bandits has shown that it not only has state-of-the-art regret, but it also exhibits empirical performance comparable to or better than the best current contenders, and leads to strikingly simple index policies. Motivated by this, we examine the finite-time performance of RBMLE for reinforcement learning tasks that involve the general problem of optimal control of unknown Markov Decision Processes. We show that it has a regret of $\mathcal{O}( \log T)$ over a time horizon of $T$ steps, similar to state-of-the-art algorithms. Simulation studies show that RBMLE outperforms other algorithms such as UCRL2 and Thompson Sampling.
Abstract（参考訳）: マルコフ連鎖の適応制御に対する報奨バイアス最大度推定(rbmle)は、適応制御の基本的な「閉じた識別可能性問題」や「二重制御問題」、あるいは同時に「探索対搾取問題」と呼ばれるものの中心的障害を克服するために提案された。本研究は, 最大確率パラメータ推定器が一定の等価な手法で閉じた遷移確率を漸近的に同定できるので, 限界パラメータ推定は, 真だが未知のシステムに対して達成可能な最適報酬よりも少ない最適報酬を持つ必要がある。したがって、より大きな最適報酬を持つパラメータを優先して反作用する逆バイアスを提案し、上記の基本問題に対する解決策を提供した。これにより、より広い最適報酬を持つパラメータを優先する楽観的なアプローチが提案され、「不確実性に直面した最適化」として知られるようになった。 RBMLEアプローチは、様々な文脈で最適な長期平均報酬であることが証明されている。しかし、現代の注目はより微細な"regret"あるいは有限時間パフォーマンスの概念に焦点を当てている。近年のマルチアーム確率的バンディットと線形文脈的バンディットに対するRAMLEの分析は、この手法が最先端の後悔を持つだけでなく、最高の競合相手に匹敵する経験的性能を示し、驚くほど単純なインデックスポリシーをもたらすことを示している。そこで本研究では,未知マルコフ決定過程の最適制御問題を含む強化学習タスクに対するrbmleの有限時間性能について検討する。我々は、最先端のアルゴリズムと同様に、$\mathcal{O}( \log T)$を時間軸に$T$のステップで後悔していることを示します。シミュレーション研究により、RBMLEはUCRL2やトンプソンサンプリングのような他のアルゴリズムよりも優れていることが示された。

関連論文リスト

Asymptotically Optimal Linear Best Feasible Arm Identification with Fixed Budget [55.938644481736446]
本稿では,誤差確率の指数的減衰を保証し,最適な腕識別のための新しいアルゴリズムを提案する。我々は,複雑性のレベルが異なる様々な問題インスタンスに対する包括的経験的評価を通じて,アルゴリズムの有効性を検証する。
論文参考訳（メタデータ） (2025-06-03T02:56:26Z)
Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
教師付き微調整とオフライン強化学習の間には,新たな理論的関係が確立されている。広く使われているビームサーチ法は、許容できない過度な最適化に悩まされていることを示す。本稿では,トークンレベル$Q$-value推定のための簡易かつ効果的な補助的損失を導入したSupervised Optimism Correctionを提案する。
論文参考訳（メタデータ） (2025-04-10T07:50:03Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
モデルベース強化学習における累積報酬に対する不確実性を定量化する問題を考察する。我々は、解が値の真後分散に収束する新しい不確実性ベルマン方程式(UBE)を提案する。本稿では,リスク・サーキングとリスク・アバース・ポリシー最適化のいずれにも適用可能な汎用ポリシー最適化アルゴリズムQ-Uncertainty Soft Actor-Critic (QU-SAC)を導入する。
論文参考訳（メタデータ） (2023-12-07T15:55:58Z)
Autoregressive Bandits [58.46584210388307]
本稿では,オンライン学習環境であるAutoregressive Banditsを提案する。報酬プロセスの軽微な仮定の下では、最適ポリシーを便利に計算できることが示される。次に、新しい楽観的後悔最小化アルゴリズム、すなわちAutoRegressive Upper Confidence Bound (AR-UCB)を考案し、$widetildemathcalO left( frac(k+1)3/2sqrtnT (1-G)のサブ線形後悔を被る。
論文参考訳（メタデータ） (2022-12-12T21:37:36Z)
Learning to Optimize with Stochastic Dominance Constraints [103.26714928625582]
本稿では,不確実量を比較する問題に対して,単純かつ効率的なアプローチを開発する。我々はラグランジアンの内部最適化をサロゲート近似の学習問題として再考した。提案したライト-SDは、ファイナンスからサプライチェーン管理に至るまで、いくつかの代表的な問題において優れた性能を示す。
論文参考訳（メタデータ） (2022-11-14T21:54:31Z)
Neural Contextual Bandits via Reward-Biased Maximum Likelihood Estimation [9.69596041242667]
Reward-biased maximum max estimation (RBMLE) は、探索・探索トレードオフに対処するための適応制御文学における古典的な原理である。本稿では,一般有界報酬関数を用いた文脈的バンディット問題について検討し,RBMLEの原理を適用したNeuralRBMLEを提案する。両アルゴリズムは、非線形報酬関数を持つ実世界のデータセットにおける最先端の手法と比較して、同等またはより良い経験的後悔を実現する。
論文参考訳（メタデータ） (2022-03-08T16:33:36Z)
Policy Gradient Bayesian Robust Optimization for Imitation Learning [49.881386773269746]
我々は、期待される性能とリスクのバランスをとるために、新しいポリシー勾配スタイルのロバスト最適化手法PG-BROILを導出する。その結果,PG-BROILはリスクニュートラルからリスク・アバースまでの行動のファミリを創出できる可能性が示唆された。
論文参考訳（メタデータ） (2021-06-11T16:49:15Z)
Navigating to the Best Policy in Markov Decision Processes [68.8204255655161]
マルコフ決定過程における純粋探索問題について検討する。エージェントはアクションを逐次選択し、結果のシステム軌道から可能な限り早くベストを目標とする。
論文参考訳（メタデータ） (2021-06-05T09:16:28Z)
Optimal Algorithms for Stochastic Multi-Armed Bandits with Heavy Tailed Rewards [24.983866845065926]
我々は、重い尾の報酬を持つマルチアームのバンディットを考えており、そのp$-thのモーメントは、定数$nu_p$が1pleq2$である。本稿では,従来の情報として$nu_p$を必要としない新しいロバストな推定器を提案する。提案した推定器の誤差確率は指数関数的に高速に減衰することを示す。
論文参考訳（メタデータ） (2020-10-24T10:44:02Z)
Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss [145.54544979467872]
マルコフ決定過程(CMDP)に対するオンライン学習の検討本稿では,遷移モデルから標本化した軌跡のみを必要とする,新しいEmphupper confidence primal-dualアルゴリズムを提案する。我々の分析では、ラグランジュ乗算過程の新たな高確率ドリフト解析を、高信頼強化学習の記念後悔解析に組み入れている。
論文参考訳（メタデータ） (2020-03-02T05:02:23Z)
A General Theory of the Stochastic Linear Bandit and Its Applications [8.071506311915398]
本稿では,線形バンディット問題に対する一般解析フレームワークとアルゴリズム群を紹介する。予測における最適化という新たな概念は、OFULの過剰探索問題を減少させるSieeved greedy(SG)と呼ばれる新しいアルゴリズムを生み出します。 SGが理論的に最適であることを示すことに加えて、実験シミュレーションにより、SGはgreedy、OFUL、TSといった既存のベンチマークよりも優れていることが示された。
論文参考訳（メタデータ） (2020-02-12T18:54:41Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。