When does return-conditioned supervised learning work for offline
reinforcement learning?
- URL: http://arxiv.org/abs/2206.01079v1
- Date: Thu, 2 Jun 2022 15:05:42 GMT
- Title: When does return-conditioned supervised learning work for offline
reinforcement learning?
- Authors: David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche,
Joan Bruna
- Abstract summary: We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
- Score: 51.899892382786526
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Several recent works have proposed a class of algorithms for the offline
reinforcement learning (RL) problem that we will refer to as return-conditioned
supervised learning (RCSL). RCSL algorithms learn the distribution of actions
conditioned on both the state and the return of the trajectory. Then they
define a policy by conditioning on achieving high return. In this paper, we
provide a rigorous study of the capabilities and limitations of RCSL, something
which is crucially missing in previous work. We find that RCSL returns the
optimal policy under a set of assumptions that are stronger than those needed
for the more traditional dynamic programming-based algorithms. We provide
specific examples of MDPs and datasets that illustrate the necessity of these
assumptions and the limits of RCSL. Finally, we present empirical evidence that
these limitations will also cause issues in practice by providing illustrative
experiments in simple point-mass environments and on datasets from the D4RL
benchmark.
Related papers
- How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy.
In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks.
We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z) - Constraint Sampling Reinforcement Learning: Incorporating Expertise For
Faster Learning [43.562783189118]
We introduce a practical algorithm for incorporating human insight to speed learning.
Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy.
In all cases, CSRL learns a good policy faster than baselines.
arXiv Detail & Related papers (2021-12-30T22:02:42Z) - Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR)
We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Keep Doing What Worked: Behavioral Modelling Priors for Offline
Reinforcement Learning [25.099754758455415]
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set of environment interactions is available.
Standard off-policy algorithms fail in the batch setting for continuous control.
arXiv Detail & Related papers (2020-02-19T19:21:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.