Fugu-MT 論文翻訳(概要): Online Sub-Sampling for Reinforcement Learning with General Function Approximation

論文の概要: Online Sub-Sampling for Reinforcement Learning with General Function Approximation

arxiv url: http://arxiv.org/abs/2106.07203v1
Date: Mon, 14 Jun 2021 07:36:25 GMT
ステータス: 翻訳完了
システム内更新日: 2021-06-15 16:37:44.695617
Title: Online Sub-Sampling for Reinforcement Learning with General Function Approximation
Title（参考訳）: 一般関数近似を用いた強化学習のためのオンラインサブサンプリング
Authors: Dingwen Kong, Ruslan Salakhutdinov, Ruosong Wang, Lin F. Yang
Abstract要約: 我々は,平均1ラウンドあたりの計算時間を$widetildeO(mathrmpoly(dH))$$$で計算し,ほぼ同じ後悔境界を満足するアルゴリズムを開発した。このアルゴリズムは低いスイッチングコスト、すなわち、実行中に$widetildeO(mathrmpoly(dH))$時間だけポリシーを変更する。
参考スコア（独自算出の注目度）: 102.28103481747165
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Designing provably efficient algorithms with general function approximation is an important open problem in reinforcement learning. Recently, Wang et al.~[2020c] establish a value-based algorithm with general function approximation that enjoys $\widetilde{O}(\mathrm{poly}(dH)\sqrt{K})$\footnote{Throughout the paper, we use $\widetilde{O}(\cdot)$ to suppress logarithm factors. } regret bound, where $d$ depends on the complexity of the function class, $H$ is the planning horizon, and $K$ is the total number of episodes. However, their algorithm requires $\Omega(K)$ computation time per round, rendering the algorithm inefficient for practical use. In this paper, by applying online sub-sampling techniques, we develop an algorithm that takes $\widetilde{O}(\mathrm{poly}(dH))$ computation time per round on average, and enjoys nearly the same regret bound. Furthermore, the algorithm achieves low switching cost, i.e., it changes the policy only $\widetilde{O}(\mathrm{poly}(dH))$ times during its execution, making it appealing to be implemented in real-life scenarios. Moreover, by using an upper-confidence based exploration-driven reward function, the algorithm provably explores the environment in the reward-free setting. In particular, after $\widetilde{O}(\mathrm{poly}(dH))/\epsilon^2$ rounds of exploration, the algorithm outputs an $\epsilon$-optimal policy for any given reward function.
Abstract（参考訳）: 一般関数近似を用いた有効効率なアルゴリズムの設計は強化学習において重要なオープン問題である。最近、Wang et al.~[2020c]は、対数係数を抑えるために$\widetilde{O}(\mathrm{poly}(dH)\sqrt{K})$\footnote{Throughout the paper, we use $\widetilde{O}(\cdot)$. } を楽しむ一般関数近似を用いた値ベースのアルゴリズムを確立している。残念なことに、$d$ は関数クラスの複雑さに依存し、$h$ は計画の地平線であり、$k$ はエピソードの総数である。しかし、それらのアルゴリズムは1ラウンドあたり$\Omega(K)$計算時間を必要とし、このアルゴリズムは実用上非効率である。本稿では,オンラインサブサンプリング手法を適用して,平均1ラウンドあたりの計算時間を$\widetilde{O}(\mathrm{poly}(dH))$$$\widetilde{O}(\mathrm{poly}(dH))とすることで,ほぼ同じ後悔点を持つアルゴリズムを開発した。さらに、アルゴリズムは低スイッチングコスト、すなわち、実行中に$\widetilde{O}(\mathrm{poly}(dH))$時間だけポリシーを変更し、実際のシナリオで実装することをアピールする。さらに, 高信頼度に基づく探索駆動報酬関数を用いて, 報奨条件下での環境を良好に探索する。特に、$\widetilde{o}(\mathrm{poly}(dh))/\epsilon^2$ rounds of explorationの後、アルゴリズムは与えられた報酬関数に対して$\epsilon$-optimalポリシーを出力する。

関連論文リスト

Accelerating RL for LLM Reasoning with Optimal Advantage Regression [52.0792918455501]
本稿では,最適優位関数を直接近似する新しい2段階ポリシー最適化フレームワークを提案する。 A$*-POは、幅広い数学的推論ベンチマークで競合性能を達成する。 PPO、GRPO、REBELと比較して、トレーニング時間を最大2$times$、ピークメモリ使用率を30%以上削減する。
論文参考訳（メタデータ） (2025-05-27T03:58:50Z)
Near-Optimal Regret in Linear MDPs with Aggregate Bandit Feedback [38.61232011566285]
本稿では,最近提案されたRLモデルとアグリゲート帯域フィードバック(RL-ABF)について検討する。本稿では,ABFを線形関数近似に拡張し,ほぼ最適後悔保証を伴う2つの効率的なアルゴリズムを開発する。
論文参考訳（メタデータ） (2024-05-13T10:51:01Z)
A Nearly Optimal and Low-Switching Algorithm for Reinforcement Learning with General Function Approximation [66.26739783789387]
我々は、強化学習のための新しいアルゴリズム、MQL-UCBを用いたモノトニックQ-Learningを提案する。 MQL-UCBは、$tildeO(dsqrtHK)$の最小限の後悔を実現する。本研究は,非線形関数近似を用いたサンプル効率およびデプロイメント効率のよいQ-ラーニングの設計に重点を置いている。
論文参考訳（メタデータ） (2023-11-26T08:31:57Z)
Low-Switching Policy Gradient with Exploration via Online Sensitivity Sampling [23.989009116398208]
一般非線形関数近似を用いた低スイッチングサンプリング効率ポリシ最適化アルゴリズム LPO を設計する。提案アルゴリズムは,$widetildeO(fractextpoly(d)varepsilon3)$サンプルのみを用いて,$varepsilon$-optimal Policyを得る。
論文参考訳（メタデータ） (2023-06-15T23:51:46Z)
Provably Efficient Reinforcement Learning via Surprise Bound [66.15308700413814]
本稿では,一般値関数近似を用いた効率の良い強化学習アルゴリズムを提案する。本アルゴリズムは, 線形設定と疎高次元線形設定の両方に適用した場合に, 合理的な後悔境界を達成できる。
論文参考訳（メタデータ） (2023-02-22T20:21:25Z)
Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation [16.871660060209674]
本研究では, 線形関数近似を用いた展開効率向上強化学習(RL)の課題を, 遠近自由探索条件下で検討する。我々は,最大$widetildeO(fracd2H5epsilon2)$ trajectoriesを$H$デプロイメント内で収集し,$epsilon$-Optimal Policyを任意の(おそらくはデータに依存した)報酬関数の選択に対して識別するアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-10-03T03:48:26Z)
Human-in-the-loop: Provably Efficient Preference-based Reinforcement Learning with General Function Approximation [107.54516740713969]
本研究は,RL(Human-in-the-loop reinforcement learning)を軌道的嗜好で検討する。各ステップで数値的な報酬を受ける代わりに、エージェントは人間の監督者から軌道上のペアよりも優先される。一般関数近似を用いたPbRLの楽観的モデルベースアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-05-23T09:03:24Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) は、エージェントが探索中に報酬関数にアクセスできないような環境を考える。この分離は線形MDPの設定には存在しないことを示す。我々は$d$次元線形 MDP における報酬のない RL に対する計算効率の良いアルゴリズムを開発した。
論文参考訳（メタデータ） (2022-01-26T22:09:59Z)
Reinforcement Learning with General Value Function Approximation: Provably Efficient Approach via Bounded Eluder Dimension [124.7752517531109]
一般値関数近似を用いた効率の良い強化学習アルゴリズムを確立する。我々のアルゴリズムは、$d$が複雑性測度である場合、$widetildeO(mathrmpoly(dH)sqrtT)$の後悔の限界を達成することを示す。我々の理論は線形値関数近似によるRLの最近の進歩を一般化し、環境モデルに対する明示的な仮定をしない。
論文参考訳（メタデータ） (2020-05-21T17:36:09Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。