Critic-Guided Decision Transformer for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2312.13716v1
- Date: Thu, 21 Dec 2023 10:29:17 GMT
- Title: Critic-Guided Decision Transformer for Offline Reinforcement Learning
- Authors: Yuanfu Wang, Chao Yang, Ying Wen, Yu Liu, Yu Qiao
- Abstract summary: Critic-Guided Decision Transformer (CGDT)
Uses predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
Builds upon these insights, we propose a novel approach, which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
- Score: 28.211835303617118
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in offline reinforcement learning (RL) have underscored
the capabilities of Return-Conditioned Supervised Learning (RCSL), a paradigm
that learns the action distribution based on target returns for each state in a
supervised manner. However, prevailing RCSL methods largely focus on
deterministic trajectory modeling, disregarding stochastic state transitions
and the diversity of future trajectory distributions. A fundamental challenge
arises from the inconsistency between the sampled returns within individual
trajectories and the expected returns across multiple trajectories.
Fortunately, value-based methods offer a solution by leveraging a value
function to approximate the expected returns, thereby addressing the
inconsistency effectively. Building upon these insights, we propose a novel
approach, termed the Critic-Guided Decision Transformer (CGDT), which combines
the predictability of long-term returns from value-based methods with the
trajectory modeling capability of the Decision Transformer. By incorporating a
learned value function, known as the critic, CGDT ensures a direct alignment
between the specified target returns and the expected returns of actions. This
integration bridges the gap between the deterministic nature of RCSL and the
probabilistic characteristics of value-based methods. Empirical evaluations on
stochastic environments and D4RL benchmark datasets demonstrate the superiority
of CGDT over traditional RCSL methods. These results highlight the potential of
CGDT to advance the state of the art in offline RL and extend the applicability
of RCSL to a wide range of RL tasks.
Related papers
- Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Value-Aided Conditional Supervised Learning for Offline RL [21.929683225837078]
Value-Aided Conditional Supervised Learning (VCS) is a method that synergizes the stability of RCSL with the stitching ability of value-based methods.
Based on the Neural Tangent Kernel analysis, VCS injects the value aid into the RCSL's loss function dynamically according to the trajectory return.
Our empirical studies reveal that VCS not only significantly outperforms both RCSL and value-based methods but also consistently achieves, or often surpasses, the highest trajectory returns.
arXiv Detail & Related papers (2024-02-03T04:17:09Z) - Non-ergodicity in reinforcement learning: robustness via ergodicity transformations [8.44491527275706]
Application areas for reinforcement learning (RL) include autonomous driving, precision agriculture, and finance.
We argue that a fundamental issue contributing to this lack of robustness lies in the focus on the expected value of the return.
We propose an algorithm for learning ergodicity from data and demonstrate its effectiveness in an instructive, non-ergodic environment.
arXiv Detail & Related papers (2023-10-17T15:13:33Z) - Statistically Efficient Variance Reduction with Double Policy Estimation
for Off-Policy Evaluation in Sequence-Modeled Reinforcement Learning [53.97273491846883]
We propose DPE: an RL algorithm that blends offline sequence modeling and offline reinforcement learning with Double Policy Estimation.
We validate our method in multiple tasks of OpenAI Gym with D4RL benchmarks.
arXiv Detail & Related papers (2023-08-28T20:46:07Z) - Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level
Stability and High-Level Behavior [51.60683890503293]
We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling.
We show that pure supervised cloning can generate trajectories matching the per-time step distribution of arbitrary expert trajectories.
arXiv Detail & Related papers (2023-07-27T04:27:26Z) - Reinforcement Learning from Diverse Human Preferences [68.4294547285359]
This paper develops a method for crowd-sourcing preference labels and learning from diverse human preferences.
The proposed method is tested on a variety of tasks in DMcontrol and Meta-world.
It has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback.
arXiv Detail & Related papers (2023-01-27T15:18:54Z) - Backward Imitation and Forward Reinforcement Learning via Bi-directional
Model Rollouts [11.4219428942199]
Traditional model-based reinforcement learning (RL) methods generate forward rollout traces using the learnt dynamics model.
In this paper, we propose the backward imitation and forward reinforcement learning (BIFRL) framework.
BIFRL empowers the agent to both reach to and explore from high-value states in a more efficient manner.
arXiv Detail & Related papers (2022-08-04T04:04:05Z) - When does return-conditioned supervised learning work for offline
reinforcement learning? [51.899892382786526]
We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
arXiv Detail & Related papers (2022-06-02T15:05:42Z) - Foresee then Evaluate: Decomposing Value Estimation with Latent Future
Prediction [37.06232589005015]
Value function is the central notion of Reinforcement Learning (RL)
We propose Value Decomposition with Future Prediction (VDFP)
We analytically decompose the value function into a latent future dynamics part and a policy-independent trajectory return part, inducing a way to model latent dynamics and returns separately in value estimation.
arXiv Detail & Related papers (2021-03-03T07:28:56Z) - COMBO: Conservative Offline Model-Based Policy Optimization [120.55713363569845]
Uncertainty estimation with complex models, such as deep neural networks, can be difficult and unreliable.
We develop a new model-based offline RL algorithm, COMBO, that regularizes the value function on out-of-support state-actions.
We find that COMBO consistently performs as well or better as compared to prior offline model-free and model-based methods.
arXiv Detail & Related papers (2021-02-16T18:50:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.