Related papers: Confidence-Conditioned Value Functions for Offline Reinforcement Learning

Confidence-Conditioned Value Functions for Offline Reinforcement Learning

URL: http://arxiv.org/abs/2212.04607v2
Date: Mon, 30 Oct 2023 04:57:27 GMT
Title: Confidence-Conditioned Value Functions for Offline Reinforcement Learning
Authors: Joey Hong and Aviral Kumar and Sergey Levine
Abstract summary: We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
Score: 86.59173545987984
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Offline reinforcement learning (RL) promises the ability to learn effective policies solely using existing, static datasets, without any costly online interaction. To do so, offline RL methods must handle distributional shift between the dataset and the learned policy. The most common approach is to learn conservative, or lower-bound, value functions, which underestimate the return of out-of-distribution (OOD) actions. However, such methods exhibit one notable drawback: policies optimized on such value functions can only behave according to a fixed, possibly suboptimal, degree of conservatism. However, this can be alleviated if we instead are able to learn policies for varying degrees of conservatism at training time and devise a method to dynamically choose one of them during evaluation. To do so, in this work, we propose learning value functions that additionally condition on the degree of conservatism, which we dub confidence-conditioned value functions. We derive a new form of a Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability. By conditioning on confidence, our value functions enable adaptive strategies during online evaluation by controlling for confidence level using the history of observations thus far. This approach can be implemented in practice by conditioning the Q-function from existing conservative algorithms on the confidence.We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence. Finally, we empirically show that our algorithm outperforms existing conservative offline RL algorithms on multiple discrete control domains.

Related papers

Strategically Conservative Q-Learning [89.17906766703763]
offline reinforcement learning (RL) is a compelling paradigm to extend RL's practical utility. The major difficulty in offline RL is mitigating the impact of approximation errors when encountering out-of-distribution (OOD) actions. We propose a novel framework called Strategically Conservative Q-Learning (SCQ) that distinguishes between OOD data that is easy and hard to estimate.
arXiv Detail & Related papers (2024-06-06T22:09:46Z)
Compositional Conservatism: A Transductive Approach in Offline Reinforcement Learning [38.48360240082561]
We propose COmpositional COnservatism with Anchor-seeking (COCOA) for offline reinforcement learning. We apply COCOA to four state-of-the-art offline RL algorithms and evaluate them on the D4RL benchmark.
arXiv Detail & Related papers (2024-04-06T17:02:18Z)
Vlearn: Off-Policy Learning with Efficient State-Value Function Estimation [22.129001951441015]
Existing off-policy reinforcement learning algorithms often rely on an explicit state-action-value function representation. This reliance results in data inefficiency as maintaining a state-action-value function in high-dimensional action spaces is challenging. We present an efficient approach that utilizes only a state-value function as the critic for off-policy deep reinforcement learning.
arXiv Detail & Related papers (2024-03-07T12:45:51Z)
Conservative State Value Estimation for Offline Reinforcement Learning [36.416504941791224]
Conservative State Value Estimation (CSVE) learns conservative V-function via directly imposing penalty on OOD states. We develop a practical actor-critic algorithm in which the critic does the conservative value estimation by additionally sampling and penalizing the states empharound the dataset. We evaluate in classic continual control tasks of D4RL, showing that our method performs better than the conservative Q-function learning methods and is strongly competitive among recent SOTA methods.
arXiv Detail & Related papers (2023-02-14T08:13:55Z)
Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment. Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic. We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z)
Bellman Residual Orthogonalization for Offline Reinforcement Learning [53.17258888552998]
We introduce a new reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along a test function space. We exploit this principle to derive confidence intervals for off-policy evaluation, as well as to optimize over policies within a prescribed policy class.
arXiv Detail & Related papers (2022-03-24T01:04:17Z)
Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment. We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return. On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z)
Offline Reinforcement Learning with Implicit Q-Learning [85.62618088890787]
Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy. We propose an offline RL method that never needs to evaluate actions outside of the dataset. This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
arXiv Detail & Related papers (2021-10-12T17:05:05Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.