Fugu-MT 論文翻訳(概要): On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

論文の概要: On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game

arxiv url: http://arxiv.org/abs/2110.09771v1
Date: Tue, 19 Oct 2021 07:26:33 GMT
ステータス: 翻訳完了
システム内更新日: 2021-10-20 13:55:51.438868
Title: On Reward-Free RL with Kernel and Neural Function Approximations: Single-Agent MDP and Markov Game
Title（参考訳）: カーネルとニューラルファンクション近似を用いたリワードフリーRLについて:シングルエージェントMDPとマルコフゲーム
Authors: Shuang Qiu, Jieping Ye, Zhaoran Wang, Zhuoran Yang
Abstract要約: エージェントが事前に特定された報酬関数を使わずに環境を徹底的に探索することを目的とした報酬のないRL問題について検討する。関数近似の文脈でこの問題に取り組み、強力な関数近似器を活用する。我々は、カーネルとニューラルファンクション近似器を用いた、証明可能な効率の良い報酬なしRLアルゴリズムを確立した。
参考スコア（独自算出の注目度）: 140.19656665344917
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To achieve sample efficiency in reinforcement learning (RL), it necessitates efficiently exploring the underlying environment. Under the offline setting, addressing the exploration challenge lies in collecting an offline dataset with sufficient coverage. Motivated by such a challenge, we study the reward-free RL problem, where an agent aims to thoroughly explore the environment without any pre-specified reward function. Then, given any extrinsic reward, the agent computes the policy via a planning algorithm with offline data collected in the exploration phase. Moreover, we tackle this problem under the context of function approximation, leveraging powerful function approximators. Specifically, we propose to explore via an optimistic variant of the value-iteration algorithm incorporating kernel and neural function approximations, where we adopt the associated exploration bonus as the exploration reward. Moreover, we design exploration and planning algorithms for both single-agent MDPs and zero-sum Markov games and prove that our methods can achieve $\widetilde{\mathcal{O}}(1 /\varepsilon^2)$ sample complexity for generating a $\varepsilon$-suboptimal policy or $\varepsilon$-approximate Nash equilibrium when given an arbitrary extrinsic reward. To the best of our knowledge, we establish the first provably efficient reward-free RL algorithm with kernel and neural function approximators.
Abstract（参考訳）: 強化学習(RL)におけるサンプル効率を達成するには,基礎となる環境を効率的に探索する必要がある。オフライン設定では、調査課題に対処するには、十分なカバレッジを備えたオフラインデータセットの収集が不可欠だ。このような課題に動機付けられ、エージェントが事前に特定された報酬関数を使わずに環境を徹底的に探索することを目的とした報酬のないRL問題を研究する。そして、外因的な報酬が与えられた場合、エージェントは探索フェーズで収集されたオフラインデータを含む計画アルゴリズムを介してポリシーを算出する。さらに,関数近似の文脈でこの問題に対処し,強力な関数近似器を活用する。具体的には,カーネルとニューラルファンクション近似を組み込んだ,楽観的なバリューイテレーションアルゴリズムを用いて探索を行い,探索報酬として関連する探索ボーナスを採用することを提案する。さらに,単エージェントMDPとゼロサムマルコフゲームの両方の探索および計画アルゴリズムを設計し,任意の外因性報酬を与えられた場合,$\varepsilon$-suboptimal Policyや$\varepsilon$-approximate Nash平衡を生成する際のサンプル複雑性を$\widetilde{\mathcal{O}}(1 /\varepsilon^2)で実現できることを示す。我々の知識を最大限に活用するために,カーネルおよび神経関数近似器を用いた報酬フリーrlアルゴリズムを初めて確立した。

関連論文リスト

Minimax-Optimal Reward-Agnostic Exploration in Reinforcement Learning [17.239062061431646]
本稿では,強化学習(RL)における報酬非依存探索について検討する。 S$状態、$A$作用、および水平長$H$を持つ有限水平不均一決定過程を考える。我々のアルゴリズムは任意の数の報酬関数に対して$varepsilon$精度を得ることができる。
論文参考訳（メタデータ） (2023-04-14T17:46:49Z)
Optimal Horizon-Free Reward-Free Exploration for Linear Mixture MDPs [60.40452803295326]
線形マルコフ決定過程(MDP)を学習するための新たな報酬なしアルゴリズムを提案する。我々のアルゴリズムの核心は、探索駆動の擬似回帰を用いた不確実性重み付き値目標回帰である。我々のアルゴリズムは$tilde O(d2varepsilon-2)$ episodesを探索するだけで、$varepsilon$-optimal policyを見つけることができる。
論文参考訳（メタデータ） (2023-03-17T17:53:28Z)
Near-Optimal Deployment Efficiency in Reward-Free Reinforcement Learning with Linear Function Approximation [16.871660060209674]
本研究では, 線形関数近似を用いた展開効率向上強化学習(RL)の課題を, 遠近自由探索条件下で検討する。我々は,最大$widetildeO(fracd2H5epsilon2)$ trajectoriesを$H$デプロイメント内で収集し,$epsilon$-Optimal Policyを任意の(おそらくはデータに依存した)報酬関数の選択に対して識別するアルゴリズムを提案する。
論文参考訳（メタデータ） (2022-10-03T03:48:26Z)
Provably Efficient Offline Reinforcement Learning with Trajectory-Wise Reward [66.81579829897392]
我々はPessimistic vAlue iteRaTionとrEward Decomposition (PARTED)という新しいオフライン強化学習アルゴリズムを提案する。 PartEDは、最小2乗ベースの報酬再分配を通じて、ステップごとのプロキシ報酬に軌道を分解し、学習したプロキシ報酬に基づいて悲観的な値を実行する。私たちの知る限りでは、PartEDは、トラジェクティブな報酬を持つ一般のMDPにおいて、証明可能な効率のよい最初のオフラインRLアルゴリズムである。
論文参考訳（メタデータ） (2022-06-13T19:11:22Z)
Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) は、エージェントが探索中に報酬関数にアクセスできないような環境を考える。この分離は線形MDPの設定には存在しないことを示す。我々は$d$次元線形 MDP における報酬のない RL に対する計算効率の良いアルゴリズムを開発した。
論文参考訳（メタデータ） (2022-01-26T22:09:59Z)
On Reward-Free Reinforcement Learning with Linear Function Approximation [144.4210285338698]
Reward-free reinforcement learning (RL) は、バッチRL設定と多くの報酬関数がある設定の両方に適したフレームワークである。本研究では,線形関数近似を用いた報酬のないRLに対して,正と負の両方の結果を与える。
論文参考訳（メタデータ） (2020-06-19T17:59:36Z)
Exploration by Maximizing R\'enyi Entropy for Reward-Free RL Framework [28.430845498323745]
我々は、搾取から探索を分離する報酬のない強化学習フレームワークを検討する。探索段階において、エージェントは、報酬のない環境と相互作用して探索ポリシーを学習する。計画段階では、エージェントはデータセットに基づいて報酬関数の適切なポリシーを算出する。
論文参考訳（メタデータ） (2020-06-11T05:05:31Z)
Reward-Free Exploration for Reinforcement Learning [82.3300753751066]
探索の課題を分離する「逆フリーなRL」フレームワークを提案する。我々は,$tildemathcalO(S2Amathrmpoly(H)/epsilon2)$の探索を効率的に行うアルゴリズムを提案する。また、ほぼ一致する$Omega(S2AH2/epsilon2)$ lower boundを与え、この設定でアルゴリズムのほぼ最適性を示す。
論文参考訳（メタデータ） (2020-02-07T14:03:38Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。