Efficient Online Learning with Offline Datasets for Infinite Horizon
MDPs: A Bayesian Approach
- URL: http://arxiv.org/abs/2310.11531v2
- Date: Thu, 1 Feb 2024 22:58:28 GMT
- Title: Efficient Online Learning with Offline Datasets for Infinite Horizon
MDPs: A Bayesian Approach
- Authors: Dengwang Tang, Rahul Jain, Botao Hao, Zheng Wen
- Abstract summary: We show that if the learning agent models the behavioral policy used by the expert, it can do substantially better in terms of minimizing cumulative regret.
We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
- Score: 25.77911741149966
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: In this paper, we study the problem of efficient online reinforcement
learning in the infinite horizon setting when there is an offline dataset to
start with. We assume that the offline dataset is generated by an expert but
with unknown level of competence, i.e., it is not perfect and not necessarily
using the optimal policy. We show that if the learning agent models the
behavioral policy (parameterized by a competence parameter) used by the expert,
it can do substantially better in terms of minimizing cumulative regret, than
if it doesn't do that. We establish an upper bound on regret of the exact
informed PSRL algorithm that scales as $\tilde{O}(\sqrt{T})$. This requires a
novel prior-dependent regret analysis of Bayesian online learning algorithms
for the infinite horizon setting. We then propose the Informed RLSVI algorithm
to efficiently approximate the iPSRL algorithm.
Related papers
- Online Bandit Learning with Offline Preference Data [15.799929216215672]
We propose a posterior sampling algorithm for online learning that can be warm-started with an offline dataset with noisy preference feedback.
We show that by modeling the 'competence' of the expert that generated it, we are able to use such a dataset most effectively.
arXiv Detail & Related papers (2024-06-13T20:25:52Z) - Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced
Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data.
We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset.
We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z) - Bridging Imitation and Online Reinforcement Learning: An Optimistic Tale [27.02990488317357]
Given an offline demonstration dataset from an imperfect expert, what is the best way to leverage it to bootstrap online learning performance in MDPs?
We first propose an Informed Posterior Sampling-based RL (iPSRL) algorithm that uses the offline dataset, and information about the expert's behavioral policy used to generate the offline dataset.
Since this algorithm is computationally impractical, we then propose the iRLSVI algorithm that can be seen as a combination of the RLSVI algorithm for online RL, and imitation learning.
arXiv Detail & Related papers (2023-03-20T18:16:25Z) - Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online.
We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z) - Efficient Online Reinforcement Learning with Offline Data [78.92501185886569]
We show that we can simply apply existing off-policy methods to leverage offline data when learning online.
We extensively ablate these design choices, demonstrating the key factors that most affect performance.
We see that correct application of these simple recommendations can provide a $mathbf2.5times$ improvement over existing approaches.
arXiv Detail & Related papers (2023-02-06T17:30:22Z) - On Instance-Dependent Bounds for Offline Reinforcement Learning with
Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI)
Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions.
These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z) - Discriminator-Weighted Offline Imitation Learning from Suboptimal
Demonstrations [5.760034336327491]
We study the problem of offline Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions.
We introduce an additional discriminator to distinguish expert and non-expert data.
Our proposed algorithm achieves higher returns and faster training speed compared to baseline algorithms.
arXiv Detail & Related papers (2022-07-20T17:29:04Z) - When Should We Prefer Offline Reinforcement Learning Over Behavioral
Cloning? [86.43517734716606]
offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing previously collected experience, without any online interaction.
behavioral cloning (BC) algorithms mimic a subset of the dataset via supervised learning.
We show that policies trained on sufficiently noisy suboptimal data can attain better performance than even BC algorithms with expert data.
arXiv Detail & Related papers (2022-04-12T08:25:34Z) - OptiDICE: Offline Policy Optimization via Stationary Distribution
Correction Estimation [59.469401906712555]
We present an offline reinforcement learning algorithm that prevents overestimation in a more principled way.
Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy.
We show that OptiDICE performs competitively with the state-of-the-art methods.
arXiv Detail & Related papers (2021-06-21T00:43:30Z) - Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori.
We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z) - POPO: Pessimistic Offline Policy Optimization [6.122342691982727]
We study why off-policy RL methods fail to learn in offline setting from the value function view.
We propose Pessimistic Offline Policy Optimization (POPO), which learns a pessimistic value function to get a strong policy.
We find that POPO performs surprisingly well and scales to tasks with high-dimensional state and action space.
arXiv Detail & Related papers (2020-12-26T06:24:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.