Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints
- URL: http://arxiv.org/abs/2211.01052v1
- Date: Wed, 2 Nov 2022 11:36:06 GMT
- Title: Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints
- Authors: Anikait Singh, Aviral Kumar, Quan Vuong, Yevgen Chebotar, Sergey
Levine
- Abstract summary: We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
- Score: 82.43359506154117
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) learns policies entirely from static
datasets, thereby avoiding the challenges associated with online data
collection. Practical applications of offline RL will inevitably require
learning from datasets where the variability of demonstrated behaviors changes
non-uniformly across the state space. For example, at a red light, nearly all
human drivers behave similarly by stopping, but when merging onto a highway,
some drivers merge quickly, efficiently, and safely, while many hesitate or
merge dangerously. Both theoretically and empirically, we show that typical
offline RL methods, which are based on distribution constraints fail to learn
from data with such non-uniform variability, due to the requirement to stay
close to the behavior policy to the same extent across the state space.
Ideally, the learned policy should be free to choose per state how closely to
follow the behavior policy to maximize long-term return, as long as the learned
policy stays within the support of the behavior policy. To instantiate this
principle, we reweight the data distribution in conservative Q-learning (CQL)
to obtain an approximate support constraint formulation. The reweighted
distribution is a mixture of the current policy and an additional policy
trained to mine poor actions that are likely under the behavior policy. Our
method, CQL (ReDS), is simple, theoretically motivated, and improves
performance across a wide range of offline RL problems in Atari games,
navigation, and pixel-based manipulation.
Related papers
- Policy Regularization with Dataset Constraint for Offline Reinforcement
Learning [27.868687398300658]
We consider the problem of learning the best possible policy from a fixed dataset, known as offline Reinforcement Learning (RL)
In this paper, we find that regularizing the policy towards the nearest state-action pair can be more effective and thus propose Policy Regularization with dataset Constraint (PRDC)
PRDC can guide the policy with proper behaviors from the dataset, allowing it to choose actions that do not appear in the dataset along with the given state.
arXiv Detail & Related papers (2023-06-11T03:02:10Z) - Offline Imitation Learning with Suboptimal Demonstrations via Relaxed
Distribution Matching [109.5084863685397]
offline imitation learning (IL) promises the ability to learn performant policies from pre-collected demonstrations without interactions with the environment.
We present RelaxDICE, which employs an asymmetrically-relaxed f-divergence for explicit support regularization.
Our method significantly outperforms the best prior offline method in six standard continuous control environments.
arXiv Detail & Related papers (2023-03-05T03:35:11Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Regularizing a Model-based Policy Stationary Distribution to Stabilize
Offline Reinforcement Learning [62.19209005400561]
offline reinforcement learning (RL) extends the paradigm of classical RL algorithms to purely learning from static datasets.
A key challenge of offline RL is the instability of policy training, caused by the mismatch between the distribution of the offline data and the undiscounted stationary state-action distribution of the learned policy.
We regularize the undiscounted stationary distribution of the current policy towards the offline data during the policy optimization process.
arXiv Detail & Related papers (2022-06-14T20:56:16Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - Constraints Penalized Q-Learning for Safe Offline Reinforcement Learning [15.841609263723575]
We study the problem of safe offline reinforcement learning (RL)
The goal is to learn a policy that maximizes long-term reward while satisfying safety constraints given only offline data, without further interaction with the environment.
We show that na"ive approaches that combine techniques from safe RL and offline RL can only learn sub-optimal solutions.
arXiv Detail & Related papers (2021-07-19T16:30:14Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.