Near-Optimal Offline Reinforcement Learning via Double Variance
Reduction
- URL: http://arxiv.org/abs/2102.01748v1
- Date: Tue, 2 Feb 2021 20:47:35 GMT
- Title: Near-Optimal Offline Reinforcement Learning via Double Variance
Reduction
- Authors: Ming Yin, Yu Bai, Yu-Xiang Wang
- Abstract summary: Off-Policy Double Variance Reduction is a new variance reduction based algorithm for offline RL.
We show that OPDVR provably identifies an $epsilon$-optimal policy with $widetildeO(H2/d_mepsilon2)$ episodes of offline data.
We also show that OPDVR also achieves rate-optimal sample complexity under alternative settings.
- Score: 36.027428493021716
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We consider the problem of offline reinforcement learning (RL) -- a
well-motivated setting of RL that aims at policy optimization using only
historical data. Despite its wide applicability, theoretical understandings of
offline RL, such as its optimal sample complexity, remain largely open even in
basic settings such as \emph{tabular} Markov Decision Processes (MDPs).
In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new
variance reduction based algorithm for offline RL. Our main result shows that
OPDVR provably identifies an $\epsilon$-optimal policy with
$\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the
finite-horizon stationary transition setting, where $H$ is the horizon length
and $d_m$ is the minimal marginal state-action distribution induced by the
behavior policy. This improves over the best known upper bound by a factor of
$H$. Moreover, we establish an information-theoretic lower bound of
$\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to
logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal
sample complexity under alternative settings such as the finite-horizon MDPs
with non-stationary transitions and the infinite horizon MDPs with discounted
rewards.
Related papers
- Order-Optimal Instance-Dependent Bounds for Offline Reinforcement Learning with Preference Feedback [56.6950165117658]
We consider offline reinforcement learning with preference feedback in which the implicit reward is a linear function of an unknown parameter.
We propose an algorithm, underlineRL with underlineLocally underlineOptimal underlineWeights or sc RL-LOW, which yields a simple regret of $exp.
We observe that the lower and upper bounds on the simple regret match order-wise in the exponent, demonstrating order-wise optimality of sc RL-LOW.
arXiv Detail & Related papers (2024-06-18T02:03:12Z) - Offline Primal-Dual Reinforcement Learning for Linear MDPs [16.782625445546273]
Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
arXiv Detail & Related papers (2023-05-22T11:45:23Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Provably Efficient Offline Reinforcement Learning with Trajectory-Wise
Reward [66.81579829897392]
We propose a novel offline reinforcement learning algorithm called Pessimistic vAlue iteRaTion with rEward Decomposition (PARTED)
PARTED decomposes the trajectory return into per-step proxy rewards via least-squares-based reward redistribution, and then performs pessimistic value based on the learned proxy reward.
To the best of our knowledge, PARTED is the first offline RL algorithm that is provably efficient in general MDP with trajectory-wise reward.
arXiv Detail & Related papers (2022-06-13T19:11:22Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Settling the Sample Complexity of Model-Based Offline Reinforcement
Learning [50.5790774201146]
offline reinforcement learning (RL) learns using pre-collected data without further exploration.
Prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality.
We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost.
arXiv Detail & Related papers (2022-04-11T17:26:19Z) - Reward-Free RL is No Harder Than Reward-Aware RL in Linear Markov
Decision Processes [61.11090361892306]
Reward-free reinforcement learning (RL) considers the setting where the agent does not have access to a reward function during exploration.
We show that this separation does not exist in the setting of linear MDPs.
We develop a computationally efficient algorithm for reward-free RL in a $d$-dimensional linear MDP.
arXiv Detail & Related papers (2022-01-26T22:09:59Z) - Efficient Learning in Non-Stationary Linear Markov Decision Processes [17.296084954104415]
We study episodic reinforcement learning in non-stationary linear (a.k.a. low-rank) Markov Decision Processes (MDPs)
We propose OPT-WLSVI an optimistic model-free algorithm based on weighted least squares value which uses exponential weights to smoothly forget data that are far in the past.
We show that our algorithm, when competing against the best policy at each time, achieves a regret that is upper bounded by $widetildemathcalO(d5/4H2 Delta1/4 K3/4)$ where $d$
arXiv Detail & Related papers (2020-10-24T11:02:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.