DCE: Offline Reinforcement Learning With Double Conservative Estimates
- URL: http://arxiv.org/abs/2209.13132v1
- Date: Tue, 27 Sep 2022 03:34:19 GMT
- Title: DCE: Offline Reinforcement Learning With Double Conservative Estimates
- Authors: Chen Zhao, Kai Xing Huang, Chun yuan
- Abstract summary: We propose a simple conservative estimation method, double conservative estimates (DCE)
Our algorithm introduces V-function to avoid the error of in-distribution action while implicit achieving conservative estimation.
Our experiment separately shows that two conservative estimation methods impact the estimation of all state-action.
- Score: 20.48354991493888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline Reinforcement Learning has attracted much interest in solving the
application challenge for traditional reinforcement learning. Offline
reinforcement learning uses previously-collected datasets to train agents
without any interaction. For addressing the overestimation of OOD
(out-of-distribution) actions, conservative estimates give a low value for all
inputs. Previous conservative estimation methods are usually difficult to avoid
the impact of OOD actions on Q-value estimates. In addition, these algorithms
usually need to lose some computational efficiency to achieve the purpose of
conservative estimation. In this paper, we propose a simple conservative
estimation method, double conservative estimates (DCE), which use two
conservative estimation method to constraint policy. Our algorithm introduces
V-function to avoid the error of in-distribution action while implicit
achieving conservative estimation. In addition, our algorithm uses a
controllable penalty term changing the degree of conservatism in training. We
theoretically show how this method influences the estimation of OOD actions and
in-distribution actions. Our experiment separately shows that two conservative
estimation methods impact the estimation of all state-action. DCE demonstrates
the state-of-the-art performance on D4RL.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states.
We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset.
We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - Pessimistic Bootstrapping for Uncertainty-Driven Offline Reinforcement
Learning [125.8224674893018]
Offline Reinforcement Learning (RL) aims to learn policies from previously collected datasets without exploring the environment.
Applying off-policy algorithms to offline RL usually fails due to the extrapolation error caused by the out-of-distribution (OOD) actions.
We propose Pessimistic Bootstrapping for offline RL (PBRL), a purely uncertainty-driven offline algorithm without explicit policy constraints.
arXiv Detail & Related papers (2022-02-23T15:27:16Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Reducing Conservativeness Oriented Offline Reinforcement Learning [29.895142928565228]
In offline reinforcement learning, a policy learns to maximize cumulative rewards with a fixed collection of data.
We propose the method of reducing conservativeness oriented reinforcement learning.
Our proposed method is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function.
arXiv Detail & Related papers (2021-02-27T01:21:01Z) - CoinDICE: Off-Policy Confidence Interval Estimation [107.86876722777535]
We study high-confidence behavior-agnostic off-policy evaluation in reinforcement learning.
We show in a variety of benchmarks that the confidence interval estimates are tighter and more accurate than existing methods.
arXiv Detail & Related papers (2020-10-22T12:39:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.