Conservative Offline Distributional Reinforcement Learning
- URL: http://arxiv.org/abs/2107.06106v1
- Date: Mon, 12 Jul 2021 15:38:06 GMT
- Title: Conservative Offline Distributional Reinforcement Learning
- Authors: Yecheng Jason Ma, Dinesh Jayaraman, Osbert Bastani
- Abstract summary: We propose Conservative Offline Distributional Actor Critic (CODAC) for both risk-neutral and risk-averse domains.
CODAC adapts distributional RL to the offline setting by penalizing the predicted quantiles of the return for out-of-distribution actions.
In experiments, CODAC successfully learns risk-averse policies using offline data collected purely from risk-neutral agents.
- Score: 34.95001490294207
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many reinforcement learning (RL) problems in practice are offline, learning
purely from observational data. A key challenge is how to ensure the learned
policy is safe, which requires quantifying the risk associated with different
actions. In the online setting, distributional RL algorithms do so by learning
the distribution over returns (i.e., cumulative rewards) instead of the
expected return; beyond quantifying risk, they have also been shown to learn
better representations for planning. We propose Conservative Offline
Distributional Actor Critic (CODAC), an offline RL algorithm suitable for both
risk-neutral and risk-averse domains. CODAC adapts distributional RL to the
offline setting by penalizing the predicted quantiles of the return for
out-of-distribution actions. We prove that CODAC learns a conservative return
distribution -- in particular, for finite MDPs, CODAC converges to an uniform
lower bound on the quantiles of the return distribution; our proof relies on a
novel analysis of the distributional Bellman operator. In our experiments, on
two challenging robot navigation tasks, CODAC successfully learns risk-averse
policies using offline data collected purely from risk-neutral agents.
Furthermore, CODAC is state-of-the-art on the D4RL MuJoCo benchmark in terms of
both expected and risk-sensitive performance.
Related papers
- CDSA: Conservative Denoising Score-based Algorithm for Offline Reinforcement Learning [25.071018803326254]
Distribution shift is a major obstacle in offline reinforcement learning.
Previous conservative offline RL algorithms struggle to generalize to unseen actions.
We propose to use the gradient fields of the dataset density generated from a pre-trained offline RL algorithm to adjust the original actions.
arXiv Detail & Related papers (2024-06-11T17:59:29Z) - Bridging Distributionally Robust Learning and Offline RL: An Approach to
Mitigate Distribution Shift and Partial Data Coverage [32.578787778183546]
offline reinforcement learning (RL) algorithms learn optimal polices using historical (offline) data.
One of the main challenges in offline RL is the distribution shift.
We propose two offline RL algorithms using the distributionally robust learning (DRL) framework.
arXiv Detail & Related papers (2023-10-27T19:19:30Z) - Policy Evaluation in Distributional LQR [70.63903506291383]
We provide a closed-form expression of the distribution of the random return.
We show that this distribution can be approximated by a finite number of random variables.
Using the approximate return distribution, we propose a zeroth-order policy gradient algorithm for risk-averse LQR.
arXiv Detail & Related papers (2023-03-23T20:27:40Z) - TD3 with Reverse KL Regularizer for Offline Reinforcement Learning from
Mixed Datasets [118.22975463000928]
We consider an offline reinforcement learning (RL) setting where the agent need to learn from a dataset collected by rolling out multiple behavior policies.
There are two challenges for this setting: 1) The optimal trade-off between optimizing the RL signal and the behavior cloning (BC) signal changes on different states due to the variation of the action coverage induced by different behavior policies.
In this paper, we address both challenges by using adaptively weighted reverse Kullback-Leibler (KL) divergence as the BC regularizer based on the TD3 algorithm.
arXiv Detail & Related papers (2022-12-05T09:36:23Z) - Wall Street Tree Search: Risk-Aware Planning for Offline Reinforcement
Learning [8.089234432461804]
offline reinforcement-learning (RL) algorithms learn to make decisions using a given, fixed training dataset without the possibility of additional online data collection.
This problem setting is captivating because it holds the promise of utilizing previously collected datasets without any costly or risky interaction with the environment.
We present a simple-yet-highly-effective risk-aware planning algorithm for offline RL.
arXiv Detail & Related papers (2022-11-06T07:42:24Z) - Offline RL With Realistic Datasets: Heteroskedasticity and Support
Constraints [82.43359506154117]
We show that typical offline reinforcement learning methods fail to learn from data with non-uniform variability.
Our method is simple, theoretically motivated, and improves performance across a wide range of offline RL problems in Atari games, navigation, and pixel-based manipulation.
arXiv Detail & Related papers (2022-11-02T11:36:06Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Supervised Advantage Actor-Critic for Recommender Systems [76.7066594130961]
We propose negative sampling strategy for training the RL component and combine it with supervised sequential learning.
Based on sampled (negative) actions (items), we can calculate the "advantage" of a positive action over the average case.
We instantiate SNQN and SA2C with four state-of-the-art sequential recommendation models and conduct experiments on two real-world datasets.
arXiv Detail & Related papers (2021-11-05T12:51:15Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Continuous Doubly Constrained Batch Reinforcement Learning [93.23842221189658]
We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment.
The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data.
We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates.
arXiv Detail & Related papers (2021-02-18T08:54:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.