Conservative State Value Estimation for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2302.06884v2
- Date: Sat, 2 Dec 2023 14:08:25 GMT
- Title: Conservative State Value Estimation for Offline Reinforcement Learning
- Authors: Liting Chen, Jie Yan, Zhengdao Shao, Lu Wang, Qingwei Lin, Saravan
Rajmohan, Thomas Moscibroda and Dongmei Zhang
- Abstract summary: Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states.
We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset.
We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
- Score: 36.416504941791224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning faces a significant challenge of value
over-estimation due to the distributional drift between the dataset and the
current learned policy, leading to learning failure in practice. The common
approach is to incorporate a penalty term to reward or value estimation in the
Bellman iterations. Meanwhile, to avoid extrapolation on out-of-distribution
(OOD) states and actions, existing methods focus on conservative Q-function
estimation. In this paper, we propose Conservative State Value Estimation
(CSVE), a new approach that learns conservative V-function via directly
imposing penalty on OOD states. Compared to prior work, CSVE allows more
effective state value estimation with conservative guarantees and further
better policy optimization. Further, we apply CSVE and develop a practical
actor-critic algorithm in which the critic does the conservative value
estimation by additionally sampling and penalizing the states \emph{around} the
dataset, and the actor applies advantage weighted updates extended with state
exploration to improve the policy. We evaluate in classic continual control
tasks of D4RL, showing that our method performs better than the conservative
Q-function learning methods and is strongly competitive among recent SOTA
methods.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - DCE: Offline Reinforcement Learning With Double Conservative Estimates [20.48354991493888]
We propose a simple conservative estimation method, double conservative estimates (DCE)
Our algorithm introduces V-function to avoid the error of in-distribution action while implicit achieving conservative estimation.
Our experiment separately shows that two conservative estimation methods impact the estimation of all state-action.
arXiv Detail & Related papers (2022-09-27T03:34:19Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Variance-Aware Off-Policy Evaluation with Linear Function Approximation [85.75516599931632]
We study the off-policy evaluation problem in reinforcement learning with linear function approximation.
We propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration.
arXiv Detail & Related papers (2021-06-22T17:58:46Z) - Reducing Conservativeness Oriented Offline Reinforcement Learning [29.895142928565228]
In offline reinforcement learning, a policy learns to maximize cumulative rewards with a fixed collection of data.
We propose the method of reducing conservativeness oriented reinforcement learning.
Our proposed method is able to tackle the skewed distribution of the provided dataset and derive a value function closer to the expected value function.
arXiv Detail & Related papers (2021-02-27T01:21:01Z) - Minimax-Optimal Off-Policy Evaluation with Linear Function Approximation [49.502277468627035]
This paper studies the statistical theory of batch data reinforcement learning with function approximation.
Consider the off-policy evaluation problem, which is to estimate the cumulative value of a new target policy from logged history.
arXiv Detail & Related papers (2020-02-21T19:20:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.