Value-Aided Conditional Supervised Learning for Offline RL
- URL: http://arxiv.org/abs/2402.02017v1
- Date: Sat, 3 Feb 2024 04:17:09 GMT
- Title: Value-Aided Conditional Supervised Learning for Offline RL
- Authors: Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung
- Abstract summary: Value-Aided Conditional Supervised Learning (VCS) is a method that synergizes the stability of RCSL with the stitching ability of value-based methods.
Based on the Neural Tangent Kernel analysis, VCS injects the value aid into the RCSL's loss function dynamically according to the trajectory return.
Our empirical studies reveal that VCS not only significantly outperforms both RCSL and value-based methods but also consistently achieves, or often surpasses, the highest trajectory returns.
- Score: 21.929683225837078
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) has seen notable advancements through
return-conditioned supervised learning (RCSL) and value-based methods, yet each
approach comes with its own set of practical challenges. Addressing these, we
propose Value-Aided Conditional Supervised Learning (VCS), a method that
effectively synergizes the stability of RCSL with the stitching ability of
value-based methods. Based on the Neural Tangent Kernel analysis to discern
instances where value function may not lead to stable stitching, VCS injects
the value aid into the RCSL's loss function dynamically according to the
trajectory return. Our empirical studies reveal that VCS not only significantly
outperforms both RCSL and value-based methods but also consistently achieves,
or often surpasses, the highest trajectory returns across diverse offline RL
benchmarks. This breakthrough in VCS paves new paths in offline RL, pushing the
limits of what can be achieved and fostering further innovations.
Related papers
- Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate [4.6659670917171825]
Recurrent reinforcement learning (RL) consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction.
Previous RL methods face training stability issues due to the gradient instability of RNNs.
We propose Recurrent Off-policy RL with Context-Encoder-Specific Learning Rate (RESeL) to tackle this issue.
arXiv Detail & Related papers (2024-05-24T09:33:47Z) - Critic-Guided Decision Transformer for Offline Reinforcement Learning [28.211835303617118]
Critic-Guided Decision Transformer (CGDT)
Uses predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
Builds upon these insights, we propose a novel approach, which combines the predictability of long-term returns from value-based methods with the trajectory modeling capability of the Decision Transformer.
arXiv Detail & Related papers (2023-12-21T10:29:17Z) - Boosting Offline Reinforcement Learning via Data Rebalancing [104.3767045977716]
offline reinforcement learning (RL) is challenged by the distributional shift between learning policies and datasets.
We propose a simple yet effective method to boost offline RL algorithms based on the observation that resampling a dataset keeps the distribution support unchanged.
We dub our method ReD (Return-based Data Rebalance), which can be implemented with less than 10 lines of code change and adds negligible running time.
arXiv Detail & Related papers (2022-10-17T16:34:01Z) - RORL: Robust Offline Reinforcement Learning via Conservative Smoothing [72.8062448549897]
offline reinforcement learning can exploit the massive amount of offline data for complex decision-making tasks.
Current offline RL algorithms are generally designed to be conservative for value estimation and action selection.
We propose Robust Offline Reinforcement Learning (RORL) with a novel conservative smoothing technique.
arXiv Detail & Related papers (2022-06-06T18:07:41Z) - When does return-conditioned supervised learning work for offline
reinforcement learning? [51.899892382786526]
We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
arXiv Detail & Related papers (2022-06-02T15:05:42Z) - Offline Reinforcement Learning with Value-based Episodic Memory [19.12430651038357]
offline reinforcement learning (RL) shows promise of applying RL to real-world problems.
We propose Expectile V-Learning (EVL), which smoothly interpolates between the optimal value learning and behavior cloning.
We present a new offline method called Value-based Episodic Memory (VEM)
arXiv Detail & Related papers (2021-10-19T08:20:11Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.