SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks
- URL: http://arxiv.org/abs/2501.03676v2
- Date: Sun, 12 Jan 2025 09:40:44 GMT
- Title: SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks
- Authors: Zheng Chun,
- Abstract summary: We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC.
Our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
- Score: 0.0
- License:
- Abstract: In this work, we build upon the offline reinforcement learning algorithm TD7, which incorporates State-Action Learned Embeddings (SALE) and a prioritized experience replay buffer (LAP). We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. The ensemble Q-networks introduce penalties to guide the actor network toward in-distribution actions, effectively addressing the challenge of out-of-distribution actions. Meanwhile, the gradient diversity penalty encourages diverse Q-value gradients, further suppressing overestimation for out-of-distribution actions. Additionally, our method retains an adjustable behavior cloning (BC) term that directs the actor network toward dataset actions during early training stages, while gradually reducing its influence as the precision of the Q-ensemble improves. These enhancements work synergistically to improve the stability and precision of the training. Experimental results on the D4RL MuJoCo benchmarks demonstrate that our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
Related papers
- Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization [15.605124749589946]
CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1.
We identify challenges in the training dynamics, which are emphasized by higher UTD ratios.
Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks.
arXiv Detail & Related papers (2025-02-11T12:55:32Z) - SPEQ: Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
Recent off-policy algorithms improve sample efficiency by increasing the Update-To-Data ratio and performing more gradient updates per environment interaction.
While this improves sample efficiency, it significantly increases computational cost due to the higher number of gradient updates required.
We propose a sample-efficient method to improve computational efficiency by separating training into distinct learning phases.
arXiv Detail & Related papers (2025-01-15T09:04:19Z) - Towards Continual Learning Desiderata via HSIC-Bottleneck
Orthogonalization and Equiangular Embedding [55.107555305760954]
We propose a conceptually simple yet effective method that attributes forgetting to layer-wise parameter overwriting and the resulting decision boundary distortion.
Our method achieves competitive accuracy performance, even with absolute superiority of zero exemplar buffer and 1.02x the base model.
arXiv Detail & Related papers (2024-01-17T09:01:29Z) - Implicit Stochastic Gradient Descent for Training Physics-informed
Neural Networks [51.92362217307946]
Physics-informed neural networks (PINNs) have effectively been demonstrated in solving forward and inverse differential equation problems.
PINNs are trapped in training failures when the target functions to be approximated exhibit high-frequency or multi-scale features.
In this paper, we propose to employ implicit gradient descent (ISGD) method to train PINNs for improving the stability of training process.
arXiv Detail & Related papers (2023-03-03T08:17:47Z) - Swapped goal-conditioned offline reinforcement learning [8.284193221280216]
We present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG)
In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks.
arXiv Detail & Related papers (2023-02-17T13:22:40Z) - Actor Prioritized Experience Replay [0.0]
Prioritized Experience Replay (PER) allows agents to learn from transitions sampled with non-uniform probability proportional to their temporal-difference (TD) error.
We introduce a novel experience replay sampling framework for actor-critic methods, which also regards issues with stability and recent findings behind the poor empirical performance of PER.
An extensive set of experiments verifies our theoretical claims and demonstrates that the introduced method significantly outperforms the competing approaches.
arXiv Detail & Related papers (2022-09-01T15:27:46Z) - Simultaneous Double Q-learning with Conservative Advantage Learning for
Actor-Critic Methods [133.85604983925282]
We propose Simultaneous Double Q-learning with Conservative Advantage Learning (SDQ-CAL)
Our algorithm realizes less biased value estimation and achieves state-of-the-art performance in a range of continuous control benchmark tasks.
arXiv Detail & Related papers (2022-05-08T09:17:16Z) - Uncertainty Weighted Actor-Critic for Offline Reinforcement Learning [63.53407136812255]
Offline Reinforcement Learning promises to learn effective policies from previously-collected, static datasets without the need for exploration.
Existing Q-learning and actor-critic based off-policy RL algorithms fail when bootstrapping from out-of-distribution (OOD) actions or states.
We propose Uncertainty Weighted Actor-Critic (UWAC), an algorithm that detects OOD state-action pairs and down-weights their contribution in the training objectives accordingly.
arXiv Detail & Related papers (2021-05-17T20:16:46Z) - Cross Learning in Deep Q-Networks [82.20059754270302]
We propose a novel cross Q-learning algorithm, aim at alleviating the well-known overestimation problem in value-based reinforcement learning methods.
Our algorithm builds on double Q-learning, by maintaining a set of parallel models and estimate the Q-value based on a randomly selected network.
arXiv Detail & Related papers (2020-09-29T04:58:17Z) - Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for
Addressing Value Estimation Errors [13.534873779043478]
We present a distributional soft actor-critic (DSAC) algorithm to improve the policy performance by mitigating Q-value overestimations.
We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
arXiv Detail & Related papers (2020-01-09T02:27:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.