When should we prefer Decision Transformers for Offline Reinforcement
Learning?
- URL: http://arxiv.org/abs/2305.14550v3
- Date: Mon, 11 Mar 2024 21:22:22 GMT
- Title: When should we prefer Decision Transformers for Offline Reinforcement
Learning?
- Authors: Prajjwal Bhargava, Rohan Chitnis, Alborz Geramifard, Shagun Sodhani,
Amy Zhang
- Abstract summary: Three popular algorithms for offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and Decision Transformer (DT)
We study this question empirically by exploring the performance of these algorithms across the commonly used D4RL and Robomimicity benchmarks.
We find that scaling the amount of data for DT by 5x gives a 2.5x average score improvement on Atari.
- Score: 29.107029606830015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) allows agents to learn effective,
return-maximizing policies from a static dataset. Three popular algorithms for
offline RL are Conservative Q-Learning (CQL), Behavior Cloning (BC), and
Decision Transformer (DT), from the class of Q-Learning, Imitation Learning,
and Sequence Modeling respectively. A key open question is: which algorithm is
preferred under what conditions? We study this question empirically by
exploring the performance of these algorithms across the commonly used D4RL and
Robomimic benchmarks. We design targeted experiments to understand their
behavior concerning data suboptimality, task complexity, and stochasticity. Our
key findings are: (1) DT requires more data than CQL to learn competitive
policies but is more robust; (2) DT is a substantially better choice than both
CQL and BC in sparse-reward and low-quality data settings; (3) DT and BC are
preferable as task horizon increases, or when data is obtained from human
demonstrators; and (4) CQL excels in situations characterized by the
combination of high stochasticity and low data quality. We also investigate
architectural choices and scaling trends for DT on Atari and D4RL and make
design/scaling recommendations. We find that scaling the amount of data for DT
by 5x gives a 2.5x average score improvement on Atari.
Related papers
- Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Improving and Benchmarking Offline Reinforcement Learning Algorithms [87.67996706673674]
This work aims to bridge the gaps caused by low-level choices and datasets.
We empirically investigate 20 implementation choices using three representative algorithms.
We find two variants CRR+ and CQL+ achieving new state-of-the-art on D4RL.
arXiv Detail & Related papers (2023-06-01T17:58:46Z) - IQL-TD-MPC: Implicit Q-Learning for Hierarchical Model Predictive
Control [8.374040635931298]
We introduce an offline model-based RL algorithm, IQL-TD-MPC, that extends the state-of-the-art Temporal Difference Learning for Model Predictive Control (TD-MPC) with Implicit Q-Learning (IQL)
More specifically, we pre-train a temporally abstract IQL-TD-MPC Manager to predict "intent embeddings", which roughly correspond to subgoals, via planning.
We empirically show that augmenting state representations with intent embeddings generated by an IQL-TD-MPC manager significantly improves off-the-shelf offline RL agents
arXiv Detail & Related papers (2023-06-01T16:24:40Z) - Efficient Diffusion Policies for Offline Reinforcement Learning [85.73757789282212]
Diffsuion-QL significantly boosts the performance of offline RL by representing a policy with a diffusion model.
We propose efficient diffusion policy (EDP) to overcome these two challenges.
EDP constructs actions from corrupted ones at training to avoid running the sampling chain.
arXiv Detail & Related papers (2023-05-31T17:55:21Z) - Offline RL with No OOD Actions: In-Sample Learning via Implicit Value
Regularization [90.9780151608281]
In-sample learning (IQL) improves the policy by quantile regression using only data samples.
We make a key finding that the in-sample learning paradigm arises under the textitImplicit Value Regularization (IVR) framework.
We propose two practical algorithms, Sparse $Q$-learning (EQL) and Exponential $Q$-learning (EQL), which adopt the same value regularization used in existing works.
arXiv Detail & Related papers (2023-03-28T08:30:01Z) - Skill Decision Transformer [9.387749254963595]
Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL)
Generalized Decision Transformers (GDTs) have shown that utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data.
We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled.
arXiv Detail & Related papers (2023-01-31T11:52:46Z) - Extreme Q-Learning: MaxEnt RL without Entropy [88.97516083146371]
Modern Deep Reinforcement Learning (RL) algorithms require estimates of the maximal Q-value, which are difficult to compute in continuous domains.
We introduce a new update rule for online and offline RL which directly models the maximal value using Extreme Value Theory (EVT)
Using EVT, we derive our Extreme Q-Learning framework and consequently online and, for the first time, offline MaxEnt Q-learning algorithms.
arXiv Detail & Related papers (2023-01-05T23:14:38Z) - Q-learning Decision Transformer: Leveraging Dynamic Programming for
Conditional Sequence Modelling in Offline RL [0.0]
Decision Transformer (DT) combines the conditional policy approach and a transformer architecture.
DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy.
We propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT.
arXiv Detail & Related papers (2022-09-08T18:26:39Z) - When Should We Prefer Offline Reinforcement Learning Over Behavioral
Cloning? [86.43517734716606]
offline reinforcement learning (RL) algorithms can acquire effective policies by utilizing previously collected experience, without any online interaction.
behavioral cloning (BC) algorithms mimic a subset of the dataset via supervised learning.
We show that policies trained on sufficiently noisy suboptimal data can attain better performance than even BC algorithms with expert data.
arXiv Detail & Related papers (2022-04-12T08:25:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.