Offline Reinforcement Learning with Implicit Q-Learning
- URL: http://arxiv.org/abs/2110.06169v1
- Date: Tue, 12 Oct 2021 17:05:05 GMT
- Title: Offline Reinforcement Learning with Implicit Q-Learning
- Authors: Ilya Kostrikov, Ashvin Nair, Sergey Levine
- Abstract summary: Current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy.
We propose an offline RL method that never needs to evaluate actions outside of the dataset.
This method enables the learned policy to improve substantially over the best behavior in the data through generalization.
- Score: 85.62618088890787
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning requires reconciling two conflicting aims:
learning a policy that improves over the behavior policy that collected the
dataset, while at the same time minimizing the deviation from the behavior
policy so as to avoid errors due to distributional shift. This trade-off is
critical, because most current offline reinforcement learning methods need to
query the value of unseen actions during training to improve the policy, and
therefore need to either constrain these actions to be in-distribution, or else
regularize their values. We propose an offline RL method that never needs to
evaluate actions outside of the dataset, but still enables the learned policy
to improve substantially over the best behavior in the data through
generalization. The main insight in our work is that, instead of evaluating
unseen actions from the latest policy, we can approximate the policy
improvement step implicitly by treating the state value function as a random
variable, with randomness determined by the action (while still integrating
over the dynamics to avoid excessive optimism), and then taking a state
conditional upper expectile of this random variable to estimate the value of
the best actions in that state. This leverages the generalization capacity of
the function approximator to estimate the value of the best available action at
a given state without ever directly querying a Q-function with this unseen
action. Our algorithm alternates between fitting this upper expectile value
function and backing it up into a Q-function. Then, we extract the policy via
advantage-weighted behavioral cloning. We dub our method implicit Q-learning
(IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard
benchmark for offline reinforcement learning. We also demonstrate that IQL
achieves strong performance fine-tuning using online interaction after offline
initialization.
Related papers
- Offline Reinforcement Learning with On-Policy Q-Function Regularization [57.09073809901382]
We deal with the (potentially catastrophic) extrapolation error induced by the distribution shift between the history dataset and the desired policy.
We propose two algorithms taking advantage of the estimated Q-function through regularizations, and demonstrate they exhibit strong performance on the D4RL benchmarks.
arXiv Detail & Related papers (2023-07-25T21:38:08Z) - Confidence-Conditioned Value Functions for Offline Reinforcement
Learning [86.59173545987984]
We propose a new form of Bellman backup that simultaneously learns Q-values for any degree of confidence with high probability.
We theoretically show that our learned value functions produce conservative estimates of the true value at any desired confidence.
arXiv Detail & Related papers (2022-12-08T23:56:47Z) - Mildly Conservative Q-Learning for Offline Reinforcement Learning [63.2183622958666]
offline reinforcement learning (RL) defines the task of learning from a static logged dataset without continually interacting with the environment.
Existing approaches, penalizing the unseen actions or regularizing with the behavior policy, are too pessimistic.
We propose Mildly Conservative Q-learning (MCQ), where OOD actions are actively trained by assigning them proper pseudo Q values.
arXiv Detail & Related papers (2022-06-09T19:44:35Z) - Curriculum Offline Imitation Learning [72.1015201041391]
offline reinforcement learning tasks require the agent to learn from a pre-collected dataset with no further interactions with the environment.
We propose textitCurriculum Offline Learning (COIL), which utilizes an experience picking strategy for imitating from adaptive neighboring policies with a higher return.
On continuous control benchmarks, we compare COIL against both imitation-based and RL-based methods, showing that it not only avoids just learning a mediocre behavior on mixed datasets but is also even competitive with state-of-the-art offline RL methods.
arXiv Detail & Related papers (2021-11-03T08:02:48Z) - BRAC+: Improved Behavior Regularized Actor Critic for Offline
Reinforcement Learning [14.432131909590824]
Offline Reinforcement Learning aims to train effective policies using previously collected datasets.
Standard off-policy RL algorithms are prone to overestimations of the values of out-of-distribution (less explored) actions.
We improve the behavior regularized offline reinforcement learning and propose BRAC+.
arXiv Detail & Related papers (2021-10-02T23:55:49Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.