Related papers: Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It

URL: http://arxiv.org/abs/2602.01826v1
Date: Mon, 02 Feb 2026 09:00:53 GMT
Title: Beyond Precision: Training-Inference Mismatch is an Optimization Problem and Simple LR Scheduling Fixes It
Authors: Yaxiang Zhang, Yingru Li, Jiacai Liu, Jiawei Xu, Ziniu Li, Qian Liu, Haoyuan Li,
Abstract summary: We show that gradient noise and training-inference mismatch escalate in tandem as training progresses.<n>We find that the mismatch can be effectively suppressed by shrinking the update size.<n>We propose a simple yet effective solution: a specialized Learning Rate scheduler.
Score: 24.70923739848818
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reinforcement Learning (RL) for training Large Language Models is notoriously unstable. While recent studies attribute this to "training inference mismatch stemming" from inconsistent hybrid engines, standard remedies, such as Importance Sampling, might fail during extended training runs. In this work, we analyze this instability through the lens of optimization, demonstrating that gradient noise and training-inference mismatch escalate in tandem as training progresses. Meanwhile, we find that the mismatch can be effectively suppressed by shrinking the update size. Taken together, we deduce that the mismatch is not merely a static numerical discrepancy, but a dynamic failure coupled with the model's optimization. Based on this insight, we propose a simple yet effective solution: a specialized Learning Rate (LR) scheduler. Instead of pre-defined decay schedule in traditional LR scheduler, our method dynamically triggers LR decay based on response length, which we identify as a reliable early-warning signal for impending instability. Empirical evidence suggests that by reducing the learning rate as gradient noise rises, we can consistently stabilize RL training and keep the training-inference mismatch at a safe level.

Related papers

Stabilizing Reinforcement Learning with LLMs: Formulation and Practices [61.361819972410046]
We show why and under what conditions the true sequence-level reward can be optimized via a surrogate token-level objective in policy gradient methods such as REINFORCE.<n>This insight provides a principled explanation for the crucial role of several widely adopted techniques in stabilizing RL training.
arXiv Detail & Related papers (2025-12-01T07:45:39Z)
How Learning Rate Decay Wastes Your Best Data in Curriculum-Based LLM Pretraining [22.50461083222824]
A natural approach to better leverage high-quality data is curriculum-based pretraining, where the model is trained on data sorted in ascending order of quality as determined by a quality metric.<n>This work identifies a critical factor constraining these methods: the incompatibility between the ascending data quality order and the decaying learning rate schedule.
arXiv Detail & Related papers (2025-11-24T09:03:49Z)
Dynamic Learning Rate for Deep Reinforcement Learning: A Bandit Approach [2.743898388459522]
In deep Reinforcement Learning (RL), the learning rate critically influences both stability and performance, yet its optimal value shifts during training as the environment and policy evolve.<n>Standard decay schedulers assume monotonic convergence and often misalign with these dynamics, leading to premature or delayed adjustments.<n>We introduce LRRL, a meta-learning approach that dynamically selects the learning rate based on policy performance rather than training steps.
arXiv Detail & Related papers (2024-10-16T14:15:28Z)
Robust Deep Reinforcement Learning with Adaptive Adversarial Perturbations in Action Space [3.639580365066386]
We propose an adaptive adversarial coefficient framework to adjust the effect of the adversarial perturbation during training. The appealing feature of our method is that it is simple to deploy in real-world applications and does not require accessing the simulator in advance. The experiments in MuJoCo show that our method can improve the training stability and learn a robust policy when migrated to different test environments.
arXiv Detail & Related papers (2024-05-20T12:31:11Z)
Rethinking Classifier Re-Training in Long-Tailed Recognition: A Simple Logits Retargeting Approach [102.0769560460338]
We develop a simple logits approach (LORT) without the requirement of prior knowledge of the number of samples per class. Our method achieves state-of-the-art performance on various imbalanced datasets, including CIFAR100-LT, ImageNet-LT, and iNaturalist 2018.
arXiv Detail & Related papers (2024-03-01T03:27:08Z)
Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice. We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z)
On the Weight Dynamics of Deep Normalized Networks [5.250288418639077]
High disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability. We formalize how these disparities evolve over time by modeling weight dynamics of networks with normalization layers. We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion.
arXiv Detail & Related papers (2023-06-01T14:09:52Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)
Self-Damaging Contrastive Learning [92.34124578823977]
Unlabeled data in reality is commonly imbalanced and shows a long-tail distribution. This paper proposes a principled framework called Self-Damaging Contrastive Learning to automatically balance the representation learning without knowing the classes. Our experiments show that SDCLR significantly improves not only overall accuracies but also balancedness.
arXiv Detail & Related papers (2021-06-06T00:04:49Z)
Training Generative Adversarial Networks by Solving Ordinary Differential Equations [54.23691425062034]
We study the continuous-time dynamics induced by GAN training. From this perspective, we hypothesise that instabilities in training GANs arise from the integration error. We experimentally verify that well-known ODE solvers (such as Runge-Kutta) can stabilise training.
arXiv Detail & Related papers (2020-10-28T15:23:49Z)
Self-Adaptive Training: beyond Empirical Risk Minimization [15.59721834388181]
We propose a new training algorithm that dynamically corrects problematic labels by model predictions without incurring extra computational cost. Self-adaptive training significantly improves generalization over various levels of noises, and mitigates the overfitting issue in both natural and adversarial training. Experiments on CIFAR and ImageNet datasets verify the effectiveness of our approach in two applications.
arXiv Detail & Related papers (2020-02-24T15:47:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.