Confidence-Conditioned Value Functions for Offline Reinforcement
Learning
- URL: http://arxiv.org/abs/2212.04607v2
- Date: Mon, 30 Oct 2023 04:57:27 GMT
- Title: Confidence-Conditioned Value Functions for Offline Reinforcement
Learning
- Authors: Joey Hong and Aviral Kumar and Sergey Levine
- Abstract summary: We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
- Score: 86.59173545987984
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) promises the ability to learn effective
policies solely using existing, static datasets, without any costly online
interaction. To do so, offline RL methods must handle distributional shift
between the dataset and the learned policy. The most common approach is to
learn conservative, or lower-bound, value functions, which underestimate the
return of out-of-distribution (OOD) actions. However, such methods exhibit one
notable drawback: policies optimized on such value functions can only behave
according to a fixed, possibly suboptimal, degree of conservatism. However,
this can be alleviated if we instead are able to learn policies for varying
degrees of conservatism at training time and devise a method to dynamically
choose one of them during evaluation. To do so, in this work, we propose
learning value functions that additionally condition on the degree of
conservatism, which we dub confidence-conditioned value functions. We derive a
new form of a Bellman backup that simultaneously learns Q-values for any degree
of confidence with high probability. By conditioning on confidence, our value
functions enable adaptive strategies during online evaluation by controlling
for confidence level using the history of observations thus far. This approach
can be implemented in practice by conditioning the Q-function from existing
conservative algorithms on the confidence.We theoretically show that our
learned value functions produce conservative estimates of the true value at any
desired confidence. Finally, we empirically show that our algorithm outperforms
existing conservative offline RL algorithms on multiple discrete control
domains.
Related papers
- Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility.
The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions.
We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z) - Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning [38.48360240082561]
We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline reinforcement learning.
We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark.
arXiv Detail & Related papers (2024-04-06T17:02:18Z) - Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation [22.129001951441015]
Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation.
This reliance results in data inefficiency as maintaining a state-action-value function in high-dimensional action spaces is challenging.
We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning.
arXiv Detail & Related papers (2024-03-07T12:45:51Z) - Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states.
We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset.
We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space.
We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z) - Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.