Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization
- URL: http://arxiv.org/abs/2502.07523v1
- Date: Tue, 11 Feb 2025 12:55:32 GMT
- Title: Scaling Off-Policy Reinforcement Learning with Batch and Weight Normalization
- Authors: Daniel Palenicek, Florian Vogt, Jan Peters,
- Abstract summary: CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1.
We identify challenges in the training dynamics, which are emphasized by higher UTD ratios.
Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks.
- Score: 15.605124749589946
- License:
- Abstract: Reinforcement learning has achieved significant milestones, but sample efficiency remains a bottleneck for real-world applications. Recently, CrossQ has demonstrated state-of-the-art sample efficiency with a low update-to-data (UTD) ratio of 1. In this work, we explore CrossQ's scaling behavior with higher UTD ratios. We identify challenges in the training dynamics, which are emphasized by higher UTD ratios. To address these, we integrate weight normalization into the CrossQ framework, a solution that stabilizes training, has been shown to prevent potential loss of plasticity and keeps the effective learning rate constant. Our proposed approach reliably scales with increasing UTD ratios, achieving competitive performance across 25 challenging continuous control tasks on the DeepMind Control Suite and Myosuite benchmarks, notably the complex dog and humanoid environments. This work eliminates the need for drastic interventions, such as network resets, and offers a simple yet robust pathway for improving sample efficiency and scalability in model-free reinforcement learning.
Related papers
- SPEQ: Stabilization Phases for Efficient Q-Learning in High Update-To-Data Ratio Reinforcement Learning [51.10866035483686]
Recent off-policy algorithms improve sample efficiency by increasing the Update-To-Data ratio and performing more gradient updates per environment interaction.
While this improves sample efficiency, it significantly increases computational cost due to the higher number of gradient updates required.
We propose a sample-efficient method to improve computational efficiency by separating training into distinct learning phases.
arXiv Detail & Related papers (2025-01-15T09:04:19Z) - SALE-Based Offline Reinforcement Learning with Ensemble Q-Networks [0.0]
We propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC.
Our algorithm achieves higher convergence speed, stability, and performance compared to existing methods.
arXiv Detail & Related papers (2025-01-07T10:22:30Z) - MAD-TD: Model-Augmented Data stabilizes High Update Ratio RL [20.22674077197914]
Recent work has explored updating neural networks with large numbers of gradient steps for every new sample.
High update-to-data ratios introduce instability to the training process.
Our method, Model-Augmented Data for Temporal Difference learning (MAD-TD), uses small amounts of generated data to stabilize high UTD training.
arXiv Detail & Related papers (2024-10-11T15:13:17Z) - PYRA: Parallel Yielding Re-Activation for Training-Inference Efficient Task Adaptation [61.57833648734164]
We propose a novel Parallel Yielding Re-Activation (PYRA) method for training-inference efficient task adaptation.
PYRA outperforms all competing methods under both low compression rate and high compression rate.
arXiv Detail & Related papers (2024-03-14T09:06:49Z) - Take the Bull by the Horns: Hard Sample-Reweighted Continual Training
Improves LLM Generalization [165.98557106089777]
A key challenge is to enhance the capabilities of large language models (LLMs) amid a looming shortage of high-quality training data.
Our study starts from an empirical strategy for the light continual training of LLMs using their original pre-training data sets.
We then formalize this strategy into a principled framework of Instance-Reweighted Distributionally Robust Optimization.
arXiv Detail & Related papers (2024-02-22T04:10:57Z) - Augmenting Unsupervised Reinforcement Learning with Self-Reference [63.68018737038331]
Humans possess the ability to draw on past experiences explicitly when learning new tasks.
We propose the Self-Reference (SR) approach, an add-on module explicitly designed to leverage historical information.
Our approach achieves state-of-the-art results in terms of Interquartile Mean (IQM) performance and Optimality Gap reduction on the Unsupervised Reinforcement Learning Benchmark.
arXiv Detail & Related papers (2023-11-16T09:07:34Z) - Dynamic Update-to-Data Ratio: Minimizing World Model Overfitting [25.93711502488151]
We propose a new method that dynamically adjusts the update to data (UTD) ratio during training based on under- and overfitting detection.
We apply our method to DreamerV2, a state-of-the-art model-based reinforcement learning algorithm, and evaluate it on the DeepMind Control Suite and the Atari $100$k benchmark.
arXiv Detail & Related papers (2023-03-17T17:29:02Z) - Q-TART: Quickly Training for Adversarial Robustness and
in-Transferability [28.87208020322193]
We propose to tackle Performance, Efficiency, and Robustness, using our proposed algorithm Q-TART.
Q-TART follows the intuition that samples highly susceptible to noise strongly affect the decision boundaries learned by deep neural networks.
We demonstrate improved performance and adversarial robustness while using only a subset of the training data.
arXiv Detail & Related papers (2022-04-14T15:23:08Z) - Softmax with Regularization: Better Value Estimation in Multi-Agent
Reinforcement Learning [72.28520951105207]
Overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning.
We propose a novel regularization-based update scheme that penalizes large joint action-values deviating from a baseline.
We show that our method provides a consistent performance improvement on a set of challenging StarCraft II micromanagement tasks.
arXiv Detail & Related papers (2021-03-22T14:18:39Z) - Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose.
We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.