Offline Primal-Dual Reinforcement Learning for Linear MDPs
- URL: http://arxiv.org/abs/2305.12944v1
- Date: Mon, 22 May 2023 11:45:23 GMT
- Title: Offline Primal-Dual Reinforcement Learning for Linear MDPs
- Authors: Germano Gabbianelli, Gergely Neu, Nneka Okolo, Matteo Papini
- Abstract summary: Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from a fixed dataset of transitions collected by another policy.
This paper proposes a primal-dual optimization method based on the linear programming formulation of RL.
- Score: 16.782625445546273
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline Reinforcement Learning (RL) aims to learn a near-optimal policy from
a fixed dataset of transitions collected by another policy. This problem has
attracted a lot of attention recently, but most existing methods with strong
theoretical guarantees are restricted to finite-horizon or tabular settings. In
constrast, few algorithms for infinite-horizon settings with function
approximation and minimal assumptions on the dataset are both sample and
computationally efficient. Another gap in the current literature is the lack of
theoretical analysis for the average-reward setting, which is more challenging
than the discounted setting. In this paper, we address both of these issues by
proposing a primal-dual optimization method based on the linear programming
formulation of RL. Our key contribution is a new reparametrization that allows
us to derive low-variance gradient estimators that can be used in a stochastic
optimization scheme using only samples from the behavior policy. Our method
finds an $\varepsilon$-optimal policy with $O(\varepsilon^{-4})$ samples,
improving on the previous $O(\varepsilon^{-5})$, while being computationally
efficient for infinite-horizon discounted and average-reward MDPs with
realizable linear function approximation and partial coverage. Moreover, to the
best of our knowledge, this is the first theoretical result for average-reward
offline RL.
Related papers
- A Primal-Dual Algorithm for Offline Constrained Reinforcement Learning with Linear MDPs [18.449996575976993]
We propose a primal dual algorithm for offline RL with linear MDPs in the infinite-horizon discounted setting.
Our algorithm is the first computationally efficient algorithm in this setting that achieves sample complexity of $O(epsilon-2)$ with partial data coverage assumption.
We extend our algorithm to work in the offline constrained RL setting that enforces constraints on additional reward signals.
arXiv Detail & Related papers (2024-02-07T00:33:11Z) - Online non-parametric likelihood-ratio estimation by Pearson-divergence
functional minimization [55.98760097296213]
We introduce a new framework for online non-parametric LRE (OLRE) for the setting where pairs of iid observations $(x_t sim p, x'_t sim q)$ are observed over time.
We provide theoretical guarantees for the performance of the OLRE method along with empirical validation in synthetic experiments.
arXiv Detail & Related papers (2023-11-03T13:20:11Z) - Low-Switching Policy Gradient with Exploration via Online Sensitivity
Sampling [23.989009116398208]
We design a low-switching sample-efficient policy optimization algorithm, LPO, with general non-linear function approximation.
We show that, our algorithm obtains an $varepsilon$-optimal policy with only $widetildeO(fractextpoly(d)varepsilon3)$ samples.
arXiv Detail & Related papers (2023-06-15T23:51:46Z) - A Theoretical Analysis of Optimistic Proximal Policy Optimization in
Linear Markov Decision Processes [13.466249082564213]
We propose an optimistic variant of PPO for episodic adversarial linear MDPs with full-information feedback.
Compared with existing policy-based algorithms, we achieve the state-of-the-art regret bound in both linear MDPs and adversarial linear MDPs with full information.
arXiv Detail & Related papers (2023-05-15T17:55:24Z) - Nearly Minimax Optimal Reinforcement Learning for Linear Markov Decision
Processes [80.89852729380425]
We propose the first computationally efficient algorithm that achieves the nearly minimax optimal regret $tilde O(dsqrtH3K)$.
Our work provides a complete answer to optimal RL with linear MDPs, and the developed algorithm and theoretical tools may be of independent interest.
arXiv Detail & Related papers (2022-12-12T18:58:59Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes.
We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm.
textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z) - Human-in-the-loop: Provably Efficient Preference-based Reinforcement
Learning with General Function Approximation [107.54516740713969]
We study human-in-the-loop reinforcement learning (RL) with trajectory preferences.
Instead of receiving a numeric reward at each step, the agent only receives preferences over trajectory pairs from a human overseer.
We propose the first optimistic model-based algorithm for PbRL with general function approximation.
arXiv Detail & Related papers (2022-05-23T09:03:24Z) - Near-Optimal Offline Reinforcement Learning via Double Variance
Reduction [36.027428493021716]
Off-Policy Double Variance Reduction is a new variance reduction based algorithm for offline RL.
We show that OPDVR provably identifies an $epsilon$-optimal policy with $widetildeO(H2/d_mepsilon2)$ episodes of offline data.
We also show that OPDVR also achieves rate-optimal sample complexity under alternative settings.
arXiv Detail & Related papers (2021-02-02T20:47:35Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.