EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline
and Online RL
- URL: http://arxiv.org/abs/2007.11091v2
- Date: Wed, 13 Jan 2021 19:11:34 GMT
- Title: EMaQ: Expected-Max Q-Learning Operator for Simple Yet Effective Offline
and Online RL
- Authors: Seyed Kamyar Seyed Ghasemipour, Dale Schuurmans, Shixiang Shane Gu
- Abstract summary: Off-policy reinforcement learning holds the promise of sample-efficient learning of decision-making policies.
In the offline RL setting, standard off-policy RL methods can significantly underperform.
We introduce Expected-Max Q-Learning (EMaQ), which is more closely related to the resulting practical algorithm.
- Score: 48.552287941528
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Off-policy reinforcement learning holds the promise of sample-efficient
learning of decision-making policies by leveraging past experience. However, in
the offline RL setting -- where a fixed collection of interactions are provided
and no further interactions are allowed -- it has been shown that standard
off-policy RL methods can significantly underperform. Recently proposed methods
often aim to address this shortcoming by constraining learned policies to
remain close to the given dataset of interactions. In this work, we closely
investigate an important simplification of BCQ -- a prior approach for offline
RL -- which removes a heuristic design choice and naturally restricts extracted
policies to remain exactly within the support of a given behavior policy.
Importantly, in contrast to their original theoretical considerations, we
derive this simplified algorithm through the introduction of a novel backup
operator, Expected-Max Q-Learning (EMaQ), which is more closely related to the
resulting practical algorithm. Specifically, in addition to the distribution
support, EMaQ explicitly considers the number of samples and the proposal
distribution, allowing us to derive new sub-optimality bounds which can serve
as a novel measure of complexity for offline RL problems. In the offline RL
setting -- the main focus of this work -- EMaQ matches and outperforms prior
state-of-the-art in the D4RL benchmarks. In the online RL setting, we
demonstrate that EMaQ is competitive with Soft Actor Critic. The key
contributions of our empirical findings are demonstrating the importance of
careful generative model design for estimating behavior policies, and an
intuitive notion of complexity for offline RL problems. With its simple
interpretation and fewer moving parts, such as no explicit function
approximator representing the policy, EMaQ serves as a strong yet easy to
implement baseline for future work.
Related papers
- Offline Reinforcement Learning for Wireless Network Optimization with
Mixture Datasets [13.22086908661673]
Reinforcement learning (RL) has boosted the adoption of online RL for wireless radio resource management (RRM)
Online RL algorithms require direct interactions with the environment.
offline RL can produce a near-optimal RL policy even when all involved behavior policies are highly suboptimal.
arXiv Detail & Related papers (2023-11-19T21:02:17Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Revisiting the Linear-Programming Framework for Offline RL with General
Function Approximation [24.577243536475233]
offline reinforcement learning (RL) concerns pursuing an optimal policy for sequential decision-making from a pre-collected dataset.
Recent theoretical progress has focused on developing sample-efficient offline RL algorithms with various relaxed assumptions on data coverage and function approximators.
We revisit the linear-programming framework for offline RL, and advance the existing results in several aspects.
arXiv Detail & Related papers (2022-12-28T15:28:12Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Offline Reinforcement Learning: Fundamental Barriers for Value Function
Approximation [74.3002974673248]
We consider the offline reinforcement learning problem, where the aim is to learn a decision making policy from logged data.
offline RL is becoming increasingly relevant in practice, because online data collection is well suited to safety-critical domains.
Our results show that sample-efficient offline reinforcement learning requires either restrictive coverage conditions or representation conditions that go beyond complexity learning.
arXiv Detail & Related papers (2021-11-21T23:22:37Z) - FOCAL: Efficient Fully-Offline Meta-Reinforcement Learning via Distance
Metric Learning and Behavior Regularization [10.243908145832394]
We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks.
This problem is still not fully understood, for which two major challenges need to be addressed.
We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches.
arXiv Detail & Related papers (2020-10-02T17:13:39Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.