Reinforcement Learning as One Big Sequence Modeling Problem
- URL: http://arxiv.org/abs/2106.02039v1
- Date: Thu, 3 Jun 2021 17:58:51 GMT
- Title: Reinforcement Learning as One Big Sequence Modeling Problem
- Authors: Michael Janner, Qiyang Li, Sergey Levine
- Abstract summary: Reinforcement learning (RL) is typically concerned with estimating single-step policies or single-step models.
We view RL as a sequence modeling problem, with the goal being to predict a sequence of actions that leads to a sequence of high rewards.
- Score: 84.84564880157149
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reinforcement learning (RL) is typically concerned with estimating
single-step policies or single-step models, leveraging the Markov property to
factorize the problem in time. However, we can also view RL as a sequence
modeling problem, with the goal being to predict a sequence of actions that
leads to a sequence of high rewards. Viewed in this way, it is tempting to
consider whether powerful, high-capacity sequence prediction models that work
well in other domains, such as natural-language processing, can also provide
simple and effective solutions to the RL problem. To this end, we explore how
RL can be reframed as "one big sequence modeling" problem, using
state-of-the-art Transformer architectures to model distributions over
sequences of states, actions, and rewards. Addressing RL as a sequence modeling
problem significantly simplifies a range of design decisions: we no longer
require separate behavior policy constraints, as is common in prior work on
offline model-free RL, and we no longer require ensembles or other epistemic
uncertainty estimators, as is common in prior work on model-based RL. All of
these roles are filled by the same Transformer sequence model. In our
experiments, we demonstrate the flexibility of this approach across
long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and
offline RL.
Related papers
- Drama: Mamba-Enabled Model-Based Reinforcement Learning Is Sample and Parameter Efficient [9.519619751861333]
We propose a state space model (SSM) based world model based on Mamba.
It achieves $O(n)$ memory and computational complexity while effectively capturing long-term dependencies.
This model is accessible and can be trained on an off-the-shelf laptop.
arXiv Detail & Related papers (2024-10-11T15:10:40Z) - A Tractable Inference Perspective of Offline RL [36.563229330549284]
A popular paradigm for offline Reinforcement Learning (RL) tasks is to first fit the offline trajectories to a sequence model, and then prompt the model for actions that lead to high expected return.
This paper highlights that tractability, the ability to exactly and efficiently answer various probabilistic queries, plays an important role in offline RL.
We propose Trifle, which bridges the gap between good sequence models and high expected returns at evaluation time.
arXiv Detail & Related papers (2023-10-31T19:16:07Z) - Learning a model is paramount for sample efficiency in reinforcement
learning control of PDEs [5.488334211013093]
We show that learning an actuated model in parallel to training the RL agent significantly reduces the total amount of required data sampled from the real system.
We also show that iteratively updating the model is of major importance to avoid biases in the RL training.
arXiv Detail & Related papers (2023-02-14T16:14:39Z) - Predictable MDP Abstraction for Unsupervised Model-Based RL [93.91375268580806]
We propose predictable MDP abstraction (PMA)
Instead of training a predictive model on the original MDP, we train a model on a transformed MDP with a learned action space.
We theoretically analyze PMA and empirically demonstrate that PMA leads to significant improvements over prior unsupervised model-based RL approaches.
arXiv Detail & Related papers (2023-02-08T07:37:51Z) - When to Update Your Model: Constrained Model-based Reinforcement
Learning [50.74369835934703]
We propose a novel and general theoretical scheme for a non-decreasing performance guarantee of model-based RL (MBRL)
Our follow-up derived bounds reveal the relationship between model shifts and performance improvement.
A further example demonstrates that learning models from a dynamically-varying number of explorations benefit the eventual returns.
arXiv Detail & Related papers (2022-10-15T17:57:43Z) - Simplifying Model-based RL: Learning Representations, Latent-space
Models, and Policies with One Objective [142.36200080384145]
We propose a single objective which jointly optimize a latent-space model and policy to achieve high returns while remaining self-consistent.
We demonstrate that the resulting algorithm matches or improves the sample-efficiency of the best prior model-based and model-free RL methods.
arXiv Detail & Related papers (2022-09-18T03:51:58Z) - Decision Transformer: Reinforcement Learning via Sequence Modeling [102.86873656751489]
We present a framework that abstracts Reinforcement Learning (RL) as a sequence modeling problem.
We present Decision Transformer, an architecture that casts the problem of RL as conditional sequence modeling.
Despite its simplicity, Decision Transformer matches or exceeds the performance of state-of-the-art offline RL baselines on Atari, OpenAI Gym, and Key-to-Door tasks.
arXiv Detail & Related papers (2021-06-02T17:53:39Z) - Variational Model-based Policy Optimization [34.80171122943031]
Model-based reinforcement learning (RL) algorithms allow us to combine model-generated data with those collected from interaction with the real system in order to alleviate the data efficiency problem in RL.
We propose an objective function as a variational lower-bound of a log-likelihood of a log-likelihood to jointly learn and improve model and policy.
Our experiments on a number of continuous control tasks show that despite being more complex, our model-based (E-step) algorithm, called emactoral model-based policy optimization (VMBPO), is more sample-efficient and
arXiv Detail & Related papers (2020-06-09T18:30:15Z) - MOPO: Model-based Offline Policy Optimization [183.6449600580806]
offline reinforcement learning (RL) refers to the problem of learning policies entirely from a large batch of previously collected data.
We show that an existing model-based RL algorithm already produces significant gains in the offline setting.
We propose to modify the existing model-based RL methods by applying them with rewards artificially penalized by the uncertainty of the dynamics.
arXiv Detail & Related papers (2020-05-27T08:46:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.