Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
- URL: http://arxiv.org/abs/2410.12598v3
- Date: Wed, 08 Oct 2025 12:58:01 GMT
- Title: Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach
- Authors: Henrique DonĂ¢ncio, Antoine Barrier, Leah F. South, Florence Forbes,
- Abstract summary: In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve.<n>Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments.<n>We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps.
- Score: 2.743898388459522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve. Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments. We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps. LRRL adaptively favors rates that improve returns, remaining robust even when the candidate set includes values that individually cause divergence. Across Atari and MuJoCo benchmarks, LRRL achieves performance competitive with or superior to tuned baselines and standard schedulers. Our findings position LRRL as a practical solution for adapting to non-stationary objectives in deep RL.
Related papers
- Learning in Context, Guided by Choice: A Reward-Free Paradigm for Reinforcement Learning with Transformers [55.33468902405567]
We propose a new learning paradigm, In-Context Preference-based Reinforcement Learning (ICPRL), in which both pretraining and deployment rely solely on preference feedback.<n>ICPRL enables strong in-context generalization to unseen tasks, achieving performance comparable to ICRL methods trained with full reward supervision.
arXiv Detail & Related papers (2026-02-09T03:42:16Z) - CRL-VLA: Continual Vision-Language-Action Learning [40.18167835795084]
Continual Reinforcement Learning is a promising pathway for deploying VLA models in lifelong robotic scenarios.<n>We introduce CRL-VLA, a framework for continual post-training of VLA models with rigorous theoretical bounds.<n>We derive a unified performance bound linking the stability-plasticity trade-off to goal-conditioned advantage magnitude, scaled by policy divergence.
arXiv Detail & Related papers (2026-02-03T12:09:53Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - In-Context Reinforcement Learning through Bayesian Fusion of Context and Value Prior [53.21550098214227]
In-context reinforcement learning promises fast adaptation to unseen environments without parameter updates.<n>We introduce SPICE, a Bayesian ICRL method that learns a prior over Q-values via deep ensemble and updates this prior at test-time.<n>We prove that SPICE achieves regret-optimal behaviour in both bandits and finite-horizon MDPs, even when pretrained only on suboptimal trajectories.
arXiv Detail & Related papers (2026-01-06T13:41:31Z) - Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z) - Human-in-the-loop Online Rejection Sampling for Robotic Manipulation [55.99788088622936]
Hi-ORS stabilizes value estimation by filtering out negatively rewarded samples during online fine-tuning.<n>Hi-ORS fine-tunes a pi-base policy to master contact-rich manipulation in just 1.5 hours of real-world training.
arXiv Detail & Related papers (2025-10-30T11:53:08Z) - CurES: From Gradient Analysis to Efficient Curriculum Learning for Reasoning LLMs [53.749193998004166]
Curriculum learning plays a crucial role in enhancing the training efficiency of large language models.<n>We propose CurES, an efficient training method that accelerates convergence and employs Bayesian posterior estimation to minimize computational overhead.
arXiv Detail & Related papers (2025-10-01T15:41:27Z) - Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training [121.5858973157225]
We investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains.<n>We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains.<n>Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks.
arXiv Detail & Related papers (2025-07-16T17:59:24Z) - AdaLRS: Loss-Guided Adaptive Learning Rate Search for Efficient Foundation Model Pretraining [12.630306478872043]
We propose textbfAdaLRS, a plug-in-and-play adaptive learning rate search algorithm that conducts online optimal learning rate search.<n>Experiments show that AdaLRS adjusts suboptimal learning rates to the neighborhood of optimum with marked efficiency and effectiveness.
arXiv Detail & Related papers (2025-06-16T09:14:01Z) - Diffusion Guidance Is a Controllable Policy Improvement Operator [98.11511661904618]
CFGRL is trained with the simplicity of supervised learning, yet can further improve on the policies in the data.<n>On offline RL tasks, we observe a reliable trend -- increased guidance weighting leads to increased performance.
arXiv Detail & Related papers (2025-05-29T14:06:50Z) - Policy-labeled Preference Learning: Is Preference Enough for RLHF? [8.378137704007038]
We propose policy-labeled preference learning (PPL) to resolve likelihood mismatch issues by modeling human preferences with regret, which reflects behavior policy information.<n>Experiments in high-dimensional continuous control tasks demonstrate PPL's significant improvements in offline RLHF performance and its effectiveness in online settings.
arXiv Detail & Related papers (2025-05-06T15:09:55Z) - DUMP: Automated Distribution-Level Curriculum Learning for RL-based LLM Post-training [15.74527731339671]
We present a principled curriculum learning framework grounded in the notion of distribution-level learnability.
Our framework prioritizes distributions with either high average advantage (exploitation) or low sample count (exploration)
Our experiments show that our framework significantly improves convergence speed and final performance.
arXiv Detail & Related papers (2025-04-13T20:10:27Z) - Fast Adaptation with Behavioral Foundation Models [82.34700481726951]
Unsupervised zero-shot reinforcement learning has emerged as a powerful paradigm for pretraining behavioral foundation models.
Despite promising results, zero-shot policies are often suboptimal due to errors induced by the unsupervised training process.
We propose fast adaptation strategies that search in the low-dimensional task-embedding space of the pre-trained BFM to rapidly improve the performance of its zero-shot policies.
arXiv Detail & Related papers (2025-04-10T16:14:17Z) - Normalization and effective learning rates in reinforcement learning [52.59508428613934]
Normalization layers have recently experienced a renaissance in the deep reinforcement learning and continual learning literature.
We show that normalization brings with it a subtle but important side effect: an equivalence between growth in the norm of the network parameters and decay in the effective learning rate.
We propose to make the learning rate schedule explicit with a simple re- parameterization which we call Normalize-and-Project.
arXiv Detail & Related papers (2024-07-01T20:58:01Z) - Symmetric Reinforcement Learning Loss for Robust Learning on Diverse Tasks and Model Scales [13.818149654692863]
Reinforcement learning (RL) training is inherently unstable due to factors such as moving targets and high gradient variance.
In this work, we improve the stability of RL training by adapting the reverse cross entropy (RCE) from supervised learning for noisy data to define a symmetric RL loss.
arXiv Detail & Related papers (2024-05-27T19:28:33Z) - How Can LLM Guide RL? A Value-Based Approach [68.55316627400683]
Reinforcement learning (RL) has become the de facto standard practice for sequential decision-making problems by improving future acting policies with feedback.
Recent developments in large language models (LLMs) have showcased impressive capabilities in language understanding and generation, yet they fall short in exploration and self-improvement capabilities.
We develop an algorithm named LINVIT that incorporates LLM guidance as a regularization factor in value-based RL, leading to significant reductions in the amount of data needed for learning.
arXiv Detail & Related papers (2024-02-25T20:07:13Z) - RLIF: Interactive Imitation Learning as Reinforcement Learning [56.997263135104504]
We show how off-policy reinforcement learning can enable improved performance under assumptions that are similar but potentially even more practical than those of interactive imitation learning.
Our proposed method uses reinforcement learning with user intervention signals themselves as rewards.
This relaxes the assumption that intervening experts in interactive imitation learning should be near-optimal and enables the algorithm to learn behaviors that improve over the potential suboptimal human expert.
arXiv Detail & Related papers (2023-11-21T21:05:21Z) - COPR: Continual Learning Human Preference through Optimal Policy Regularization [32.54658750353585]
We propose a new method called Continual Optimal Policy Regularization (COPR)
COPR involves a single learning phase and doesn't necessitate complex reinforcement learning.
Our experimental results show that COPR outperforms strong Continuous Learning (CL) baselines.
arXiv Detail & Related papers (2023-10-24T10:05:32Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - Unbiased and Efficient Self-Supervised Incremental Contrastive Learning [31.763904668737304]
We propose a self-supervised Incremental Contrastive Learning (ICL) framework consisting of a novel Incremental InfoNCE (NCE-II) loss function.
ICL achieves up to 16.7x training speedup and 16.8x faster convergence with competitive results.
arXiv Detail & Related papers (2023-01-28T06:11:31Z) - Learning Rate Perturbation: A Generic Plugin of Learning Rate Schedule
towards Flatter Local Minima [40.70374106466073]
We propose a generic learning rate schedule plugin called LEArning Rate Perturbation (LEAP)
LEAP can be applied to various learning rate schedules to improve the model training by introducing a certain perturbation to the learning rate.
We conduct extensive experiments which show that training with LEAP can improve the performance of various deep learning models on diverse datasets.
arXiv Detail & Related papers (2022-08-25T05:05:18Z) - Contrastive UCB: Provably Efficient Contrastive Self-Supervised Learning in Online Reinforcement Learning [92.18524491615548]
Contrastive self-supervised learning has been successfully integrated into the practice of (deep) reinforcement learning (RL)
We study how RL can be empowered by contrastive learning in a class of Markov decision processes (MDPs) and Markov games (MGs) with low-rank transitions.
Under the online setting, we propose novel upper confidence bound (UCB)-type algorithms that incorporate such a contrastive loss with online RL algorithms for MDPs or MGs.
arXiv Detail & Related papers (2022-07-29T17:29:08Z) - Imitating, Fast and Slow: Robust learning from demonstrations via
decision-time planning [96.72185761508668]
Planning at Test-time (IMPLANT) is a new meta-algorithm for imitation learning.
We demonstrate that IMPLANT significantly outperforms benchmark imitation learning approaches on standard control environments.
arXiv Detail & Related papers (2022-04-07T17:16:52Z) - Training Efficiency and Robustness in Deep Learning [2.6451769337566406]
We study approaches to improve the training efficiency and robustness of deep learning models.
We find that prioritizing learning on more informative training data increases convergence speed and improves generalization performance on test data.
We show that a redundancy-aware modification to the sampling of training data improves the training speed and develops an efficient method for detecting the diversity of training signal.
arXiv Detail & Related papers (2021-12-02T17:11:33Z) - Simplifying Deep Reinforcement Learning via Self-Supervision [51.2400839966489]
Self-Supervised Reinforcement Learning (SSRL) is a simple algorithm that optimize policies with purely supervised losses.
We show that SSRL is surprisingly competitive to contemporary algorithms with more stable performance and less running time.
arXiv Detail & Related papers (2021-06-10T06:29:59Z) - Adaptive Gradient Method with Resilience and Momentum [120.83046824742455]
We propose an Adaptive Gradient Method with Resilience and Momentum (AdaRem)
AdaRem adjusts the parameter-wise learning rate according to whether the direction of one parameter changes in the past is aligned with the direction of the current gradient.
Our method outperforms previous adaptive learning rate-based algorithms in terms of the training speed and the test error.
arXiv Detail & Related papers (2020-10-21T14:49:00Z) - META-Learning Eligibility Traces for More Sample Efficient Temporal
Difference Learning [2.0559497209595823]
We propose a meta-learning method for adjusting the eligibility trace parameter, in a state-dependent manner.
The adaptation is achieved with the help of auxiliary learners that learn distributional information about the update targets online.
We prove that, under some assumptions, the proposed method improves the overall quality of the update targets, by minimizing the overall target error.
arXiv Detail & Related papers (2020-06-16T03:41:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.