Related papers: When does return-conditioned supervised learning work for offline reinforcement learning?

When does return-conditioned supervised learning work for offline reinforcement learning?

URL: http://arxiv.org/abs/2206.01079v1
Date: Thu, 2 Jun 2022 15:05:42 GMT
Title: When does return-conditioned supervised learning work for offline reinforcement learning?
Authors: David Brandfonbrener, Alberto Bietti, Jacob Buckman, Romain Laroche, Joan Bruna
Abstract summary: We study the capabilities and limitations of return-conditioned supervised learning. We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
Score: 51.899892382786526
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Several recent works have proposed a class of algorithms for the offline reinforcement learning (RL) problem that we will refer to as return-conditioned supervised learning (RCSL). RCSL algorithms learn the distribution of actions conditioned on both the state and the return of the trajectory. Then they define a policy by conditioning on achieving high return. In this paper, we provide a rigorous study of the capabilities and limitations of RCSL, something which is crucially missing in previous work. We find that RCSL returns the optimal policy under a set of assumptions that are stronger than those needed for the more traditional dynamic programming-based algorithms. We provide specific examples of MDPs and datasets that illustrate the necessity of these assumptions and the limits of RCSL. Finally, we present empirical evidence that these limitations will also cause issues in practice by providing illustrative experiments in simple point-mass environments and on datasets from the D4RL benchmark.

Related papers

How to Provably Improve Return Conditioned Supervised Learning? [26.915055027485465]
We propose a principled and simple framework called Reinforced RCSL.<n>Key innovation of our framework is the introduction of a concept we call the in-distribution optimal return-to-go.<n>Our theoretical analysis demonstrates that Reinforced RCSL can consistently outperform the standard RCSL approach.
arXiv Detail & Related papers (2025-06-10T05:37:51Z)
How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback. Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities. We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z)
Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets. We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged. We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z)
Jump-Start Reinforcement Learning [68.82380421479675]
We present a meta algorithm that can use offline data, demonstrations, or a pre-existing policy to initialize an RL policy. In particular, we propose Jump-Start Reinforcement Learning (JSRL), an algorithm that employs two policies to solve tasks. We show via experiments that JSRL is able to significantly outperform existing imitation and reinforcement learning algorithms.
arXiv Detail & Related papers (2022-04-05T17:25:22Z)
Constraint Sampling Reinforcement Learning: Incorporating Expertise For Faster Learning [43.562783189118]
We introduce a practical algorithm for incorporating human insight to speed learning. Our algorithm, Constraint Sampling Reinforcement Learning (CSRL), incorporates prior domain knowledge as constraints/restrictions on the RL policy. In all cases, CSRL learns a good policy faster than baselines.
arXiv Detail & Related papers (2021-12-30T22:02:42Z)
Critic Regularized Regression [70.8487887738354]
We propose a novel offline RL algorithm to learn policies from data using a form of critic-regularized regression (CRR) We find that CRR performs surprisingly well and scales to tasks with high-dimensional state and action spaces.
arXiv Detail & Related papers (2020-06-26T17:50:26Z)
Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement Learning [25.099754758455415]
Off-policy reinforcement learning algorithms promise to be applicable in settings where only a fixed data-set of environment interactions is available. Standard off-policy algorithms fail in the batch setting for continuous control.
arXiv Detail & Related papers (2020-02-19T19:21:08Z)

This list is automatically generated from the titles and abstracts of the papers in this site.