Related papers: MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations

URL: http://arxiv.org/abs/2303.17156v2
Date: Sun, 6 Aug 2023 18:41:26 GMT
Title: MAHALO: Unifying Offline Reinforcement Learning and Imitation Learning from Observations
Authors: Anqi Li, Byron Boots, Ching-An Cheng
Abstract summary: We study a new paradigm for sequential decision making, called offline policy learning from observations (PLfO) We present a generic approach to offline PLfO, called $textbfM$odality-agnostic $textbfA$dversarial $textbfH$ypothesis $textbfA$daptation for $textbfL$earning from $textbfO$bservations (MAHALO)
Score: 43.9636309593499
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We study a new paradigm for sequential decision making, called offline policy learning from observations (PLfO). Offline PLfO aims to learn policies using datasets with substandard qualities: 1) only a subset of trajectories is labeled with rewards, 2) labeled trajectories may not contain actions, 3) labeled trajectories may not be of high quality, and 4) the data may not have full coverage. Such imperfection is common in real-world learning scenarios, and offline PLfO encompasses many existing offline learning setups, including offline imitation learning (IL), offline IL from observations (ILfO), and offline reinforcement learning (RL). In this work, we present a generic approach to offline PLfO, called $\textbf{M}$odality-agnostic $\textbf{A}$dversarial $\textbf{H}$ypothesis $\textbf{A}$daptation for $\textbf{L}$earning from $\textbf{O}$bservations (MAHALO). Built upon the pessimism concept in offline RL, MAHALO optimizes the policy using a performance lower bound that accounts for uncertainty due to the dataset's insufficient coverage. We implement this idea by adversarially training data-consistent critic and reward functions, which forces the learned policy to be robust to data deficiency. We show that MAHALO consistently outperforms or matches specialized algorithms across a variety of offline PLfO tasks in theory and experiments. Our code is available at https://github.com/AnqiLi/mahalo.

Related papers

Augmenting Online RL with Offline Data is All You Need: A Unified Hybrid RL Algorithm Design and Analysis [18.323002218335215]
This paper investigates a hybrid learning framework for reinforcement learning (RL) in which the agent can leverage both an offline dataset and online interactions to learn the optimal policy.<n>We present a unified algorithm and analysis and show that augmenting confidence-based online RL algorithms with the offline dataset outperforms any pure online or offline algorithm alone.
arXiv Detail & Related papers (2025-05-19T22:58:54Z)
Efficient Online Learning with Offline Datasets for Infinite Horizon MDPs: A Bayesian Approach [25.77911741149966]
We show that if the learning agent models the behavioral policy used by the expert, it can do substantially better in terms of minimizing cumulative regret. We then propose the Informed RLSVI algorithm to efficiently approximate the iPSRL algorithm.
arXiv Detail & Related papers (2023-10-17T19:01:08Z)
Offline Retraining for Online RL: Decoupled Policy Learning to Mitigate Exploration Bias [96.14064037614942]
offline retraining, a policy extraction step at the end of online fine-tuning, is proposed. An optimistic (exploration) policy is used to interact with the environment, and a separate pessimistic (exploitation) policy is trained on all the observed data for evaluation.
arXiv Detail & Related papers (2023-10-12T17:50:09Z)
Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets [53.8218145723718]
offline policy learning is aimed at learning decision-making policies using existing datasets of trajectories without collecting additional data. We argue that when a dataset is dominated by suboptimal trajectories, state-of-the-art offline RL algorithms do not substantially improve over the average return of trajectories in the dataset. We present a realization of the sampling strategy and an algorithm that can be used as a plug-and-play module in standard offline RL algorithms.
arXiv Detail & Related papers (2023-10-06T17:58:14Z)
Adaptive Policy Learning for Offline-to-Online Reinforcement Learning [27.80266207283246]
We consider an offline-to-online setting where the agent is first learned from the offline dataset and then trained online. We propose a framework called Adaptive Policy Learning for effectively taking advantage of offline and online data.
arXiv Detail & Related papers (2023-03-14T08:13:21Z)
Benchmarks and Algorithms for Offline Preference-Based Reward Learning [41.676208473752425]
We propose an approach that uses an offline dataset to craft preference queries via pool-based active learning. Our proposed approach does not require actual physical rollouts or an accurate simulator for either the reward learning or policy optimization steps.
arXiv Detail & Related papers (2023-01-03T23:52:16Z)
On Instance-Dependent Bounds for Offline Reinforcement Learning with Linear Function Approximation [80.86358123230757]
We present an algorithm called Bootstrapped and Constrained Pessimistic Value Iteration (BCP-VI) Under a partial data coverage assumption, BCP-VI yields a fast rate of $tildemathcalO(frac1K)$ for offline RL when there is a positive gap in the optimal Q-value functions. These are the first $tildemathcalO(frac1K)$ bound and absolute zero sub-optimality bound respectively for offline RL with linear function approximation from adaptive data.
arXiv Detail & Related papers (2022-11-23T18:50:44Z)
Pessimism in the Face of Confounders: Provably Efficient Offline Reinforcement Learning in Partially Observable Markov Decision Processes [99.26864533035454]
We study offline reinforcement learning (RL) in partially observable Markov decision processes. We propose the underlineProxy variable underlinePessimistic underlinePolicy underlineOptimization (textttP3O) algorithm. textttP3O is the first provably efficient offline RL algorithm for POMDPs with a confounded dataset.
arXiv Detail & Related papers (2022-05-26T19:13:55Z)
Combining Online Learning and Offline Learning for Contextual Bandits with Deficient Support [53.11601029040302]
Current offline-policy learning algorithms are mostly based on inverse propensity score (IPS) weighting. We propose a novel approach that uses a hybrid of offline learning with online exploration. Our approach determines an optimal policy with theoretical guarantees using the minimal number of online explorations.
arXiv Detail & Related papers (2021-07-24T05:07:43Z)
Representation Matters: Offline Pretraining for Sequential Decision Making [27.74988221252854]
In this paper, we consider a slightly different approach to incorporating offline data into sequential decision-making. We find that the use of pretraining with unsupervised learning objectives can dramatically improve the performance of policy learning algorithms.
arXiv Detail & Related papers (2021-02-11T02:38:12Z)
Is Pessimism Provably Efficient for Offline RL? [104.00628430454479]
We study offline reinforcement learning (RL), which aims to learn an optimal policy based on a dataset collected a priori. We propose a pessimistic variant of the value iteration algorithm (PEVI), which incorporates an uncertainty quantifier as the penalty function.
arXiv Detail & Related papers (2020-12-30T09:06:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.