Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2409.08062v1
- Date: Thu, 12 Sep 2024 14:10:22 GMT
- Title: Q-value Regularized Decision ConvFormer for Offline Reinforcement Learning
- Authors: Teng Yan, Zhendong Ruan, Yaobang Cai, Yu Han, Wenxian Li, Yang Zhang,
- Abstract summary: Decision Transformer (DT) has demonstrated exceptional capabilities in offline reinforcement learning.
Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process.
We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values.
- Score: 5.398202201395825
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: As a data-driven paradigm, offline reinforcement learning (Offline RL) has been formulated as sequence modeling, where the Decision Transformer (DT) has demonstrated exceptional capabilities. Unlike previous reinforcement learning methods that fit value functions or compute policy gradients, DT adjusts the autoregressive model based on the expected returns, past states, and actions, using a causally masked Transformer to output the optimal action. However, due to the inconsistency between the sampled returns within a single trajectory and the optimal returns across multiple trajectories, it is challenging to set an expected return to output the optimal action and stitch together suboptimal trajectories. Decision ConvFormer (DC) is easier to understand in the context of modeling RL trajectories within a Markov Decision Process compared to DT. We propose the Q-value Regularized Decision ConvFormer (QDC), which combines the understanding of RL trajectories by DC and incorporates a term that maximizes action values using dynamic programming methods during training. This ensures that the expected returns of the sampled actions are consistent with the optimal returns. QDC achieves excellent performance on the D4RL benchmark, outperforming or approaching the optimal level in all tested environments. It particularly demonstrates outstanding competitiveness in trajectory stitching capability.
Related papers
- Return Augmented Decision Transformer for Off-Dynamics Reinforcement Learning [26.915055027485465]
We study offline off-dynamics reinforcement learning (RL) to enhance policy learning in a target domain with limited data.
Our approach centers on return-conditioned supervised learning (RCSL), particularly focusing on the decision transformer (DT)
We propose the Return Augmented Decision Transformer (RADT) method, where we augment the return in the source domain by aligning its distribution with that in the target domain.
arXiv Detail & Related papers (2024-10-30T20:46:26Z) - Truncating Trajectories in Monte Carlo Policy Evaluation: an Adaptive Approach [51.76826149868971]
Policy evaluation via Monte Carlo simulation is at the core of many MC Reinforcement Learning (RL) algorithms.
We propose as a quality index a surrogate of the mean squared error of a return estimator that uses trajectories of different lengths.
We present an adaptive algorithm called Robust and Iterative Data collection strategy Optimization (RIDO)
arXiv Detail & Related papers (2024-10-17T11:47:56Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Critic-Guided Decision Transformer for Offline Reinforcement Learning [28.211835303617118]
Critic-Guided Decision Transformer (CGDT)
Uses predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
Builds upon these insights, we propose a novel approach, which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
arXiv Detail & Related papers (2023-12-21T10:29:17Z) - Rethinking Decision Transformer via Hierarchical Reinforcement Learning [54.3596066989024]
Decision Transformer (DT) is an innovative algorithm leveraging recent advances of the transformer architecture in reinforcement learning (RL)
We introduce a general sequence modeling framework for studying sequential decision making through the lens of Hierarchical RL.
We show DT emerges as a special case of this framework with certain choices of high-level and low-level policies, and discuss the potential failure of these choices.
arXiv Detail & Related papers (2023-11-01T03:32:13Z) - Optimal Sequential Decision-Making in Geosteering: A Reinforcement
Learning Approach [0.0]
Trajectory adjustment decisions throughout the drilling process, called geosteering, affect subsequent choices and information gathering.
We use the Deep Q-Network (DQN) method, a model-free reinforcement learning (RL) method that learns directly from the decision environment.
For two previously published synthetic geosteering scenarios, our results show that RL achieves high-quality outcomes comparable to the quasi-optimal ADP.
arXiv Detail & Related papers (2023-10-07T10:49:30Z) - Supervised Pretraining Can Learn In-Context Reinforcement Learning [96.62869749926415]
In this paper, we study the in-context learning capabilities of transformers in decision-making problems.
We introduce and study Decision-Pretrained Transformer (DPT), a supervised pretraining method where the transformer predicts an optimal action.
We find that the pretrained transformer can be used to solve a range of RL problems in-context, exhibiting both exploration online and conservatism offline.
arXiv Detail & Related papers (2023-06-26T17:58:50Z) - Decision S4: Efficient Sequence-Based RL via State Spaces Layers [87.3063565438089]
We present an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model.
An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism.
arXiv Detail & Related papers (2023-06-08T13:03:53Z) - Emergent Agentic Transformer from Chain of Hindsight Experience [96.56164427726203]
We show that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
This is the first time that a simple transformer-based model performs competitively with both temporal-difference and imitation-learning-based approaches.
arXiv Detail & Related papers (2023-05-26T00:43:02Z) - Q-learning Decision Transformer: Leveraging Dynamic Programming for
Conditional Sequence Modelling in Offline RL [0.0]
Decision Transformer (DT) combines the conditional policy approach and a transformer architecture.
DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy.
We propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT.
arXiv Detail & Related papers (2022-09-08T18:26:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.