Goal-Conditioned Predictive Coding for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2307.03406v2
- Date: Sat, 28 Oct 2023 19:42:50 GMT
- Title: Goal-Conditioned Predictive Coding for Offline Reinforcement Learning
- Authors: Zilai Zeng, Ce Zhang, Shijie Wang, Chen Sun
- Abstract summary: We investigate whether sequence modeling has the ability to condense trajectories into useful representations that enhance policy learning.
We introduce Goal-Conditioned Predictive Coding, a sequence modeling objective that yields powerful trajectory representations and leads to performant policies.
- Score: 24.300131097275298
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent work has demonstrated the effectiveness of formulating decision making
as supervised learning on offline-collected trajectories. Powerful sequence
models, such as GPT or BERT, are often employed to encode the trajectories.
However, the benefits of performing sequence modeling on trajectory data remain
unclear. In this work, we investigate whether sequence modeling has the ability
to condense trajectories into useful representations that enhance policy
learning. We adopt a two-stage framework that first leverages sequence models
to encode trajectory-level representations, and then learns a goal-conditioned
policy employing the encoded representations as its input. This formulation
allows us to consider many existing supervised offline RL methods as specific
instances of our framework. Within this framework, we introduce
Goal-Conditioned Predictive Coding (GCPC), a sequence modeling objective that
yields powerful trajectory representations and leads to performant policies.
Through extensive empirical evaluations on AntMaze, FrankaKitchen and
Locomotion environments, we observe that sequence modeling can have a
significant impact on challenging decision making tasks. Furthermore, we
demonstrate that GCPC learns a goal-conditioned latent representation encoding
the future trajectory, which enables competitive performance on all three
benchmarks.
Related papers
- Causality-Enhanced Behavior Sequence Modeling in LLMs for Personalized Recommendation [47.29682938439268]
We propose a novel Counterfactual Fine-Tuning (CFT) method to improve user preference modeling.
We employ counterfactual reasoning to identify the causal effects of behavior sequences on model output.
Experiments on real-world datasets demonstrate that CFT effectively improves behavior sequence modeling.
arXiv Detail & Related papers (2024-10-30T08:41:13Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Stitching Sub-Trajectories with Conditional Diffusion Model for
Goal-Conditioned Offline RL [18.31263353823447]
We propose a model-based offline Goal-Conditioned Reinforcement Learning (Offline GCRL) method to acquire diverse goal-oriented skills.
In this paper, we use the diffusion model that generates future plans conditioned on the target goal and value, with the target value estimated from the goal-relabeled offline dataset.
We report state-of-the-art performance in the standard benchmark set of GCRL tasks, and demonstrate the capability to successfully stitch the segments of suboptimal trajectories in the offline data to generate high-quality plans.
arXiv Detail & Related papers (2024-02-11T15:23:13Z) - Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement [67.1393112206885]
Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks.
We introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level.
We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks.
arXiv Detail & Related papers (2024-02-09T07:45:26Z) - A Tractable Inference Perspective of Offline RL [36.563229330549284]
A popular paradigm for offline Reinforcement Learning (RL) tasks is to first fit the offline trajectories to a sequence model, and then prompt the model for actions that lead to high expected return.
This paper highlights that tractability, the ability to exactly and efficiently answer various probabilistic queries, plays an important role in offline RL.
We propose Trifle, which bridges the gap between good sequence models and high expected returns at evaluation time.
arXiv Detail & Related papers (2023-10-31T19:16:07Z) - Multi-Objective Decision Transformers for Offline Reinforcement Learning [7.386356540208436]
offline RL is structured to derive policies from static trajectory data without requiring real-time environment interactions.
We reformulate offline RL as a multi-objective optimization problem, where prediction is extended to states and returns.
Our experiments on D4RL benchmark locomotion tasks reveal that our propositions allow for more effective utilization of the attention mechanism in the transformer model.
arXiv Detail & Related papers (2023-08-31T00:47:58Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - When Demonstrations Meet Generative World Models: A Maximum Likelihood
Framework for Offline Inverse Reinforcement Learning [62.00672284480755]
This paper aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent.
Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving.
arXiv Detail & Related papers (2023-02-15T04:14:20Z) - Variational Latent Branching Model for Off-Policy Evaluation [23.073461349048834]
We propose a variational latent branching model (VLBM) to learn the transition function of Markov decision processes (MDPs)
We introduce the branching architecture to improve the model's robustness against randomly model weights.
We show that the VLBM outperforms existing state-of-the-art OPE methods in general.
arXiv Detail & Related papers (2023-01-28T02:20:03Z) - Discrete Factorial Representations as an Abstraction for Goal
Conditioned Reinforcement Learning [99.38163119531745]
We show that applying a discretizing bottleneck can improve performance in goal-conditioned RL setups.
We experimentally prove the expected return on out-of-distribution goals, while still allowing for specifying goals with expressive structure.
arXiv Detail & Related papers (2022-11-01T03:31:43Z) - Target-Embedding Autoencoders for Supervised Representation Learning [111.07204912245841]
This paper analyzes a framework for improving generalization in a purely supervised setting, where the target space is high-dimensional.
We motivate and formalize the general framework of target-embedding autoencoders (TEA) for supervised prediction, learning intermediate latent representations jointly optimized to be both predictable from features as well as predictive of targets.
arXiv Detail & Related papers (2020-01-23T02:37:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.