Related papers: Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

URL: http://arxiv.org/abs/2310.19805v4
Date: Tue, 28 May 2024 07:57:57 GMT
Title: Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration
Authors: Ziqi Zhang, Xiao Xiong, Zifeng Zhuang, Jinxin Liu, Donglin Wang,
Abstract summary: We study how to fine-tune offline reinforcement learning (RL) pre-trained policy. We propose Q conditioned state entropy (QCSE) as intrinsic reward. We observe significant improvements with QCSE (about 13% for CQL and 8% for Cal-QL)
Score: 29.891468119032
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Studying how to fine-tune offline reinforcement learning (RL) pre-trained policy is profoundly significant for enhancing the sample efficiency of RL algorithms. However, directly fine-tuning pre-trained policies often results in sub-optimal performance. This is primarily due to the distribution shift between offline pre-training and online fine-tuning stages. Specifically, the distribution shift limits the acquisition of effective online samples, ultimately impacting the online fine-tuning performance. In order to narrow down the distribution shift between offline and online stages, we proposed Q conditioned state entropy (QCSE) as intrinsic reward. Specifically, QCSE maximizes the state entropy of all samples individually, considering their respective Q values. This approach encourages exploration of low-frequency samples while penalizing high-frequency ones, and implicitly achieves State Marginal Matching (SMM), thereby ensuring optimal performance, solving the asymptotic sub-optimality of constraint-based approaches. Additionally, QCSE can seamlessly integrate into various RL algorithms, enhancing online fine-tuning performance. To validate our claim, we conduct extensive experiments, and observe significant improvements with QCSE (about 13% for CQL and 8% for Cal-QL). Furthermore, we extended experimental tests to other algorithms, affirming the generality of QCSE.

Related papers

Test-time Offline Reinforcement Learning on Goal-related Experience [50.94457794664909]
Research in foundation models has shown that performance can be substantially improved through test-time training.<n>We propose a novel self-supervised data selection criterion, which selects transitions from an offline dataset according to their relevance to the current state.<n>Our goal-conditioned test-time training (GC-TTT) algorithm applies this routine in a receding-horizon fashion during evaluation, adapting the policy to the current trajectory as it is being rolled out.
arXiv Detail & Related papers (2025-07-24T21:11:39Z)
Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z)
A Preliminary Investigation on the Usage of Quantum Approximate Optimization Algorithms for Test Case Selection [2.1929683225837078]
This work envisions the usage of Quantum Approximate Optimization Algorithms (QAOAs) for test case selection. QAOAs merge the potential of gate-based quantum machines with the optimization capabilities of the adiabatic evolution. Our results show that QAOAs perform better than the baseline algorithms in effectiveness while being comparable to SelectQA in terms of efficiency.
arXiv Detail & Related papers (2025-04-26T15:38:01Z)
SPEQ: Offline Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
High update-to-data (UTD) ratio algorithms in reinforcement learning (RL) improve sample efficiency but incur high computational costs, limiting real-world scalability. We propose Offline Stabilization Phases for Efficient Q-Learning (SPEQ), an RL algorithm that combines low-UTD online training with periodic offline stabilization phases. During these phases, Q-functions are fine-tuned with high UTD ratios on a fixed replay buffer, reducing redundant updates on suboptimal data.
arXiv Detail & Related papers (2025-01-15T09:04:19Z)
SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks [0.0]
We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. Our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
arXiv Detail & Related papers (2025-01-07T10:22:30Z)
On the Convergence of DP-SGD with Adaptive Clipping [56.24689348875711]
Gradient Descent with gradient clipping is a powerful technique for enabling differentially private optimization. This paper provides the first comprehensive convergence analysis of SGD with quantile clipping (QC-SGD) We show how QC-SGD suffers from a bias problem similar to constant-threshold clipped SGD but can be mitigated through a carefully designed quantile and step size schedule.
arXiv Detail & Related papers (2024-12-27T20:29:47Z)
Online Statistical Inference for Time-varying Sample-averaged Q-learning [2.2374171443798034]
This paper introduces a time-varying batch-averaged Q-learning, termed sampleaveraged Q-learning. We develop a novel framework that provides insights into the normality of the sample-averaged algorithm under mild conditions. Numerical experiments conducted on classic OpenAI Gym environments show that the time-varying sample-averaged Q-learning method consistently outperforms both single-sample and constant-batch Q-learning.
arXiv Detail & Related papers (2024-10-14T17:17:19Z)
Offline Reinforcement Learning for Learning to Dispatch for Job Shop Scheduling [0.9831489366502301]
Job Shop Scheduling Problem (JSSP) is a complex optimization problem. Online Reinforcement Learning (RL) has shown promise by quickly finding acceptable solutions for JSSP. We introduce Offline Reinforcement Learning for Learning to Dispatch (Offline-LD)
arXiv Detail & Related papers (2024-09-16T15:18:10Z)
A Perspective of Q-value Estimation on Offline-to-Online Reinforcement Learning [54.48409201256968]
offline-to-online Reinforcement Learning (O2O RL) aims to improve the performance of offline pretrained policy using only a few online samples. Most O2O methods focus on the balance between RL objective and pessimism, or the utilization of offline and online samples.
arXiv Detail & Related papers (2023-12-12T19:24:35Z)
Projected Off-Policy Q-Learning (POP-QL) for Stabilizing Offline Reinforcement Learning [57.83919813698673]
Projected Off-Policy Q-Learning (POP-QL) is a novel actor-critic algorithm that simultaneously reweights off-policy samples and constrains the policy to prevent divergence and reduce value-approximation error. In our experiments, POP-QL not only shows competitive performance on standard benchmarks, but also out-performs competing methods in tasks where the data-collection policy is significantly sub-optimal.
arXiv Detail & Related papers (2023-11-25T00:30:58Z)
Offline Minimax Soft-Q-learning Under Realizability and Partial Coverage [100.8180383245813]
We propose value-based algorithms for offline reinforcement learning (RL) We show an analogous result for vanilla Q-functions under a soft margin condition. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
arXiv Detail & Related papers (2023-02-05T14:22:41Z)
The Efficacy of Pessimism in Asynchronous Q-Learning [17.193902915070506]
We develop an algorithmic framework that incorporates the principle of pessimism into asynchronous Q-learning. This framework leads to, among other things, improved sample efficiency and enhanced adaptivity in the presence of near-expert data. Our results deliver the first theoretical support for the use of pessimism principle in the presence of Markovian non-i.i.d. data.
arXiv Detail & Related papers (2022-03-14T17:59:01Z)
Pessimistic Q-Learning for Offline Reinforcement Learning: Towards Optimal Sample Complexity [51.476337785345436]
We study a pessimistic variant of Q-learning in the context of finite-horizon Markov decision processes. A variance-reduced pessimistic Q-learning algorithm is proposed to achieve near-optimal sample complexity.
arXiv Detail & Related papers (2022-02-28T15:39:36Z)
Aggressive Q-Learning with Ensembles: Achieving Both High Sample Efficiency and High Asymptotic Performance [12.871109549160389]
We propose a novel model-free algorithm, Aggressive Q-Learning with Ensembles (AQE), which improves the sample-efficiency performance of REDQ and the performance of TQC. AQE is very simple, requiring neither distributional representation of critics nor target randomization.
arXiv Detail & Related papers (2021-11-17T14:48:52Z)
Online Target Q-learning with Reverse Experience Replay: Efficiently finding the Optimal Policy for Linear MDPs [50.75812033462294]
We bridge the gap between practical success of Q-learning and pessimistic theoretical results. We present novel methods Q-Rex and Q-RexDaRe. We show that Q-Rex efficiently finds the optimal policy for linear MDPs.
arXiv Detail & Related papers (2021-10-16T01:47:41Z)
Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble [16.92791301062903]
We propose an uncertainty-based offline RL method that takes into account the confidence of the Q-value prediction and does not require any estimation or sampling of the data distribution. Surprisingly, we find that it is possible to substantially outperform existing offline RL methods on various tasks by simply increasing the number of Q-networks along with the clipped Q-learning.
arXiv Detail & Related papers (2021-10-04T16:40:13Z)
Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods. Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z)

This list is automatically generated from the titles and abstracts of the papers in this site.