Related papers: You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments

You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments

URL: http://arxiv.org/abs/2205.15967v1
Date: Tue, 31 May 2022 17:15:44 GMT
Title: You Can't Count on Luck: Why Decision Transformers Fail in Stochastic Environments
Authors: Keiran Paster and Sheila McIlraith and Jimmy Ba
Abstract summary: Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hypers, and strong overall performance on offline tasks. However, simply conditioning a model on a desired return and taking the predicted action can fail dramatically in environments that result in a return due to luck. In this work, we describe the limitations of RvS approaches in environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions
Score: 31.117949189062895
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, methods such as Decision Transformer that reduce reinforcement learning to a prediction task and solve it via supervised learning (RvS) have become popular due to their simplicity, robustness to hyperparameters, and strong overall performance on offline RL tasks. However, simply conditioning a probabilistic model on a desired return and taking the predicted action can fail dramatically in stochastic environments since trajectories that result in a return may have only achieved that return due to luck. In this work, we describe the limitations of RvS approaches in stochastic environments and propose a solution. Rather than simply conditioning on the return of a single trajectory as is standard practice, our proposed method, ESPER, learns to cluster trajectories and conditions on average cluster returns, which are independent from environment stochasticity. Doing so allows ESPER to achieve strong alignment between target return and expected performance in real environments. We demonstrate this in several challenging stochastic offline-RL tasks including the challenging puzzle game 2048, and Connect Four playing against a stochastic opponent. In all tested domains, ESPER achieves significantly better alignment between the target return and achieved return than simply conditioning on returns. ESPER also achieves higher maximum performance than even the value-based baselines.

Related papers

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models [71.34520793462069]
Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. We introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem.
arXiv Detail & Related papers (2025-04-15T10:41:11Z)
Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning [5.398202201395825]
Decision Transformer (DT) has demonstrated exceptional capabilities in offline reinforcement learning. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values.
arXiv Detail & Related papers (2024-09-12T14:10:22Z)
Adversarially Robust Decision Transformer [17.49328076347261]
We propose a worst-case-aware RvS algorithm, the Adversarially Robust Decision Transformer (ARDT) ARDT learns and conditions the policy on in-sample minimax returns-to-go. In large-scale sequential games and continuous adversarial RL environments, ARDT demonstrates significantly superior robustness to powerful test-time adversaries.
arXiv Detail & Related papers (2024-07-25T22:12:47Z)
Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL) QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM) Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z)
Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL) We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$. The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z)
Distributionally Robust Reinforcement Learning with Interactive Data Collection: Fundamental Hardness and Near-Optimal Algorithm [14.517103323409307]
Sim-to-real gap represents disparity between training and testing environments. A promising approach to addressing this challenge is distributionally robust RL. We tackle robust RL via interactive data collection and present an algorithm with a provable sample complexity guarantee.
arXiv Detail & Related papers (2024-04-04T16:40:22Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
Critic-Guided Decision Transformer for Offline Reinforcement Learning [28.211835303617118]
Critic-Guided Decision Transformer (CGDT) Uses predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer. Builds upon these insights, we propose a novel approach, which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
arXiv Detail & Related papers (2023-12-21T10:29:17Z)
Time-series Generation by Contrastive Imitation [87.51882102248395]
We study a generative framework that seeks to combine the strengths of both: Motivated by a moment-matching objective to mitigate compounding error, we optimize a local (but forward-looking) transition policy. At inference, the learned policy serves as the generator for iterative sampling, and the learned energy serves as a trajectory-level measure for evaluating sample quality.
arXiv Detail & Related papers (2023-11-02T16:45:25Z)
Non-ergodicity in reinforcement learning: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance. We argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return. We propose an algorithm for learning ergodicity from data and demonstrate its effectiveness in an instructive, non-ergodic environment.
arXiv Detail & Related papers (2023-10-17T15:13:33Z)
Harnessing Mixed Offline Reinforcement Learning Datasets via Trajectory Weighting [29.21380944341589]
We show that state-of-the-art offline RL algorithms are overly restrained by low-return trajectories and fail to exploit trajectories to the fullest. This reweighted sampling strategy may be combined with any offline RL algorithm. We empirically show that while CQL, IQL, and TD3+BC achieve only a part of this potential policy improvement, these same algorithms fully exploit the dataset.
arXiv Detail & Related papers (2023-06-22T17:58:02Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Domain-Adjusted Regression or: ERM May Already Learn Features Sufficient for Out-of-Distribution Generalization [52.7137956951533]
We argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. We introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions.
arXiv Detail & Related papers (2022-02-14T16:42:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.