Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression
- URL: http://arxiv.org/abs/2511.11973v1
- Date: Sat, 15 Nov 2025 01:10:05 GMT
- Title: Quantile Q-Learning: Revisiting Offline Extreme Q-Learning with Quantile Regression
- Authors: Xinming Gao, Shangzhe Li, Yujin Cai, Wenwu Yu,
- Abstract summary: offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction.<n>Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem.<n>We propose a principled method to estimate the temperature coefficient $$ via quantile regression under mild assumptions.
- Score: 14.037591273612788
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Offline reinforcement learning (RL) enables policy learning from fixed datasets without further environment interaction, making it particularly valuable in high-risk or costly domains. Extreme $Q$-Learning (XQL) is a recent offline RL method that models Bellman errors using the Extreme Value Theorem, yielding strong empirical performance. However, XQL and its stabilized variant MXQL suffer from notable limitations: both require extensive hyperparameter tuning specific to each dataset and domain, and also exhibit instability during training. To address these issues, we proposed a principled method to estimate the temperature coefficient $β$ via quantile regression under mild assumptions. To further improve training stability, we introduce a value regularization technique with mild generalization, inspired by recent advances in constrained value learning. Experimental results demonstrate that the proposed algorithm achieves competitive or superior performance across a range of benchmark tasks, including D4RL and NeoRL2, while maintaining stable training dynamics and using a consistent set of hyperparameters across all datasets and domains.
Related papers
- DiSA-IQL: Offline Reinforcement Learning for Robust Soft Robot Control under Distribution Shifts [13.515728394180343]
We propose DiSA-IQL (Distribution-Shift-Aware Implicit Q-Learning), an extension of IQL that incorporates modulation by penalizing unreliable state-action pairs to mitigate distribution shift.<n> Simulation results show that DiSA-IQL consistently outperforms baseline models, including Behavior Cloning (BC), Conservative Q-Learning (CQL), and vanilla IQL.
arXiv Detail & Related papers (2025-09-30T23:53:47Z) - Physics-informed Value Learner for Offline Goal-Conditioned Reinforcement Learning [20.424372965054832]
We propose a emphPhysics-informed (Pi) regularized loss for value learning, derived from the Eikonal Partial Differential Equation (PDE)<n>Unlike generic gradient penalties that are primarily used to stabilize training, our formulation is grounded in continuous-time optimal control and encourages value functions to align with cost-to-go structures.<n>The proposed regularizer is broadly compatible with temporal-difference-based value learning and can be integrated into existing Offline GCRL algorithms.
arXiv Detail & Related papers (2025-09-08T15:08:42Z) - Provably Efficient Online RLHF with One-Pass Reward Modeling [70.82499103200402]
Reinforcement Learning from Human Feedback has shown remarkable success in aligning Large Language Models with human preferences.<n>Online RLHF has emerged as a promising direction, enabling iterative data collection and refinement.<n>We propose a one-pass reward modeling method that eliminates the need to store historical data and achieves constant-time updates per iteration.
arXiv Detail & Related papers (2025-02-11T02:36:01Z) - Stabilizing Extreme Q-learning by Maclaurin Expansion [51.041889588036895]
Extreme Q-learning (XQL) employs a loss function based on the assumption that Bellman error follows a Gumbel distribution.
It has demonstrated strong performance in both offline and online reinforcement learning settings.
We propose Maclaurin Expanded Extreme Q-learning to enhance stability.
arXiv Detail & Related papers (2024-06-07T12:43:17Z) - Q-value Regularized Transformer for Offline Reinforcement Learning [70.13643741130899]
We propose a Q-value regularized Transformer (QT) to enhance the state-of-the-art in offline reinforcement learning (RL)
QT learns an action-value function and integrates a term maximizing action-values into the training loss of Conditional Sequence Modeling (CSM)
Empirical evaluations on D4RL benchmark datasets demonstrate the superiority of QT over traditional DP and CSM methods.
arXiv Detail & Related papers (2024-05-27T12:12:39Z) - Stochastic Q-learning for Large Discrete Action Spaces [79.1700188160944]
In complex environments with discrete action spaces, effective decision-making is critical in reinforcement learning (RL)
We present value-based RL approaches which, as opposed to optimizing over the entire set of $n$ actions, only consider a variable set of actions, possibly as small as $mathcalO(log(n)$)$.
The presented value-based RL methods include, among others, Q-learning, StochDQN, StochDDQN, all of which integrate this approach for both value-function updates and action selection.
arXiv Detail & Related papers (2024-05-16T17:58:44Z) - Action-Quantized Offline Reinforcement Learning for Robotic Skill
Learning [68.16998247593209]
offline reinforcement learning (RL) paradigm provides recipe to convert static behavior datasets into policies that can perform better than the policy that collected the data.
In this paper, we propose an adaptive scheme for action quantization.
We show that several state-of-the-art offline RL methods such as IQL, CQL, and BRAC improve in performance on benchmarks when combined with our proposed discretization scheme.
arXiv Detail & Related papers (2023-10-18T06:07:10Z) - Robust Reinforcement Learning using Offline Data [23.260211453437055]
We propose a robust reinforcement learning algorithm called Robust Fitted Q-Iteration (RFQI)
RFQI uses only an offline dataset to learn the optimal robust policy.
We prove that RFQI learns a near-optimal robust policy under standard assumptions.
arXiv Detail & Related papers (2022-08-10T03:47:45Z) - Conservative Q-Learning for Offline Reinforcement Learning [106.05582605650932]
We show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return.
We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees.
arXiv Detail & Related papers (2020-06-08T17:53:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.