Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning
- URL: http://arxiv.org/abs/2402.02017v2
- Date: Tue, 22 Oct 2024 12:46:09 GMT
- Title: Adaptive $Q$-Aid for Conditional Supervised Learning in Offline Reinforcement Learning
- Authors: Jeonghye Kim, Suyoung Lee, Woojun Kim, Youngchul Sung,
- Abstract summary: $Q$-Aided Conditional Supervised Learning combines stability of RCSL with the stitching capability of $Q$-functions.
QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return.
- Score: 20.07425661382103
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Offline reinforcement learning (RL) has progressed with return-conditioned supervised learning (RCSL), but its lack of stitching ability remains a limitation. We introduce $Q$-Aided Conditional Supervised Learning (QCS), which effectively combines the stability of RCSL with the stitching capability of $Q$-functions. By analyzing $Q$-function over-generalization, which impairs stable stitching, QCS adaptively integrates $Q$-aid into RCSL's loss function based on trajectory return. Empirical results show that QCS significantly outperforms RCSL and value-based methods, consistently achieving or exceeding the maximum trajectory returns across diverse offline RL benchmarks.
Related papers
- How to Provably Improve Return Conditioned Supervised Learning? [26.915055027485465]
We propose a principled and simple framework called Reinforced RCSL.<n>Key innovation of our framework is the introduction of a concept we call the in-distribution optimal return-to-go.<n>Our theoretical analysis demonstrates that Reinforced RCSL can consistently outperform the standard RCSL approach.
arXiv Detail & Related papers (2025-06-10T05:37:51Z) - Bridging Supervised and Temporal Difference Learning with $Q$-Conditioned Maximization [23.468621564156056]
Supervised learning (SL) has emerged as an effective approach for offline reinforcement learning (RL) due to their simplicity, stability, and efficiency.<n>Recent studies show that SL methods lack the trajectory stitching capability, typically associated with temporal difference (TD)-based approaches.<n>We introduce $Q$-conditioned supervised learning for offline goal-conditioned RL.
arXiv Detail & Related papers (2025-06-01T02:49:26Z) - Learning from Suboptimal Data in Continuous Control via Auto-Regressive Soft Q-Network [23.481553466650453]
We propose Auto-Regressive Soft Q-learning (ARSQ), a value-based RL algorithm that models Q-values in a coarse-to-fine, auto-regressive manner.
ARSQ decomposes the continuous action space into discrete spaces in a coarse-to-fine hierarchy, enhancing sample efficiency for fine-grained continuous control tasks.
It auto-regressively predicts dimensional action advantages within each decision step, enabling more effective decision-making in continuous control tasks.
arXiv Detail & Related papers (2025-02-01T03:04:53Z) - Q-WSL: Optimizing Goal-Conditioned RL with Weighted Supervised Learning via Dynamic Programming [22.359171999254706]
A novel class of advanced algorithms, termed GoalConditioned Weighted Supervised Learning (GCWSL), has recently emerged to tackle the challenges posed by sparse rewards in goal-conditioned reinforcement learning (RL)
GCWSL consistently delivers strong performance across a diverse set of goal-reaching tasks due to its simplicity, effectiveness, and stability.
However, GCWSL methods lack a crucial capability known as trajectory stitching, which is essential for learning optimal policies when faced with unseen skills during testing.
In this paper, we propose Q-learning Weighted Supervised Learning (Q-WSL), a novel framework designed to overcome the limitations of GC
arXiv Detail & Related papers (2024-10-09T08:00:12Z) - Equivariant Offline Reinforcement Learning [7.822389399560674]
We investigate the use of $SO(2)$-equivariant neural networks for offline RL with a limited number of demonstrations.
Our experimental results show that equivariant versions of Conservative Q-Learning (CQL) and Implicit Q-Learning (IQL) outperform their non-equivariant counterparts.
arXiv Detail & Related papers (2024-06-20T03:02:49Z) - Switching the Loss Reduces the Cost in Batch (Offline) Reinforcement Learning [57.154674117714265]
We show that the number of samples needed to learn a near-optimal policy with FQI-log scales with the accumulated cost of the optimal policy.
We empirically verify that FQI-log uses fewer samples than FQI trained with squared loss on problems where the optimal policy reliably achieves the goal.
arXiv Detail & Related papers (2024-03-08T15:30:58Z) - Swapped goal-conditioned offline reinforcement learning [8.284193221280216]
We present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG)
In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks.
arXiv Detail & Related papers (2023-02-17T13:22:40Z) - Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL)
We show an analogous result for vanilla Q-functions under a soft margin condition.
Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z) - When does return-conditioned supervised learning work for offline
reinforcement learning? [51.899892382786526]
We study the capabilities and limitations of return-conditioned supervised learning.
We find that RCSL returns the optimal policy under a set of assumptions stronger than those needed for the more traditional dynamic programming-based algorithms.
arXiv Detail & Related papers (2022-06-02T15:05:42Z) - Value Penalized Q-Learning for Recommender Systems [30.704083806571074]
Scaling reinforcement learning to recommender systems (RS) is promising since maximizing the expected cumulative rewards for RL agents meets the objective of RS.
A key approach to this goal is offline RL, which aims to learn policies from logged data.
We propose Value Penalized Q-learning (VPQ), an uncertainty-based offline RL algorithm.
arXiv Detail & Related papers (2021-10-15T08:08:28Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - SUNRISE: A Simple Unified Framework for Ensemble Learning in Deep
Reinforcement Learning [102.78958681141577]
We present SUNRISE, a simple unified ensemble method, which is compatible with various off-policy deep reinforcement learning algorithms.
SUNRISE integrates two key ingredients: (a) ensemble-based weighted Bellman backups, which re-weight target Q-values based on uncertainty estimates from a Q-ensemble, and (b) an inference method that selects actions using the highest upper-confidence bounds for efficient exploration.
arXiv Detail & Related papers (2020-07-09T17:08:44Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.