Cautious Weight Decay
- URL: http://arxiv.org/abs/2510.12402v1
- Date: Tue, 14 Oct 2025 11:32:55 GMT
- Title: Cautious Weight Decay
- Authors: Lizhang Chen, Jonathan Li, Kaizhao Liang, Baiyu Su, Cong Xie, Nuo Wang Pierse, Chen Liang, Ni Lao, Qiang Liu,
- Abstract summary: Cautious Weight Decay (CWD) is a one-line, agnostic modification that applies weight decay only to parameter whose signs align with update.<n>CWD is a drop-in change for coordinates such as AdamW, Lion and Muon.<n>For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy.
- Score: 23.198565281737896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce Cautious Weight Decay (CWD), a one-line, optimizer-agnostic modification that applies weight decay only to parameter coordinates whose signs align with the optimizer update. Unlike standard decoupled decay, which implicitly optimizes a regularized or constrained objective, CWD preserves the original loss and admits a bilevel interpretation: it induces sliding-mode behavior upon reaching the stationary manifold, allowing it to search for locally Pareto-optimal stationary points of the unmodified objective. In practice, CWD is a drop-in change for optimizers such as AdamW, Lion, and Muon, requiring no new hyperparameters or additional tuning. For language model pre-training and ImageNet classification, CWD consistently improves final loss and accuracy at million- to billion-parameter scales.
Related papers
- Robust Unscented Kalman Filtering via Recurrent Meta-Adaptation of Sigma-Point Weights [0.0]
This work introduces the Meta-Adaptive UKF (MA-UKF), a framework that reformulates sigma-point weight as a hyper parameter optimization problem.<n>Unlike standard adaptive filters that rely on instantaneous corrections, our approach employs a Recurrent Context to compress the history of measurement innovations into a compact latent embedding.<n> Numerical benchmarks on maneuvering targets demonstrate that the MA-UKF significantly outperforms standard baselines.
arXiv Detail & Related papers (2026-03-04T18:27:59Z) - ECO: Quantized Training without Full-Precision Master Weights [58.97082407934466]
Error-Compensating (ECO) eliminates master weights by applying updates directly to quantized parameters.<n>We show that ECO converges to a constant-radius neighborhood of the optimum, while naive master-weight removal can incur an error that is inversely proportional to the learning rate.
arXiv Detail & Related papers (2026-01-29T18:35:01Z) - AdamHD: Decoupled Huber Decay Regularization for Language Model Pre-Training [0.2578242050187029]
AdamHuberDecay is a drop-in replacement for AdamW that substitutes the $ell$ penalty with a decoupled smooth Huber regularizer.<n>Experiments on GPT-2 and GPT-3 pre-training demonstrate that AdamHuberDecay converges 10-15% faster in wall-clock time.
arXiv Detail & Related papers (2025-11-18T18:08:20Z) - Closed-Form Last Layer Optimization [72.49151473937319]
Under a squared loss, the optimal solution to the linear last layer weights is known in closed-form.<n>We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer.
arXiv Detail & Related papers (2025-10-06T09:14:39Z) - A Unified Noise-Curvature View of Loss of Trainability [8.602734307457387]
Loss of trainability (LoT) in continual learning occurs when steps no longer yield improvement as tasks evolve.<n>We introduce two complementary criteria: a batch-size-aware gradient-noise bound and a curvature volatility-controlled bound.<n>Using this threshold, we build a simple per-layer scheduler that keeps each layers effective step below a safe limit.
arXiv Detail & Related papers (2025-09-24T02:11:13Z) - SPRINT: Stochastic Performative Prediction With Variance Reduction [18.735898645810405]
Performative prediction (PP) is an algorithmic framework for machine learning (ML) models where the model's deployment affects the distribution of the data it is trained on.<n>We propose a new algorithm called performative prediction with gradient reduction (SSPS) Experiments.
arXiv Detail & Related papers (2025-09-22T00:56:17Z) - Rethinking Weight Decay for Robust Fine-Tuning of Foundation Models [27.847140934456288]
This paper proposes a new weight decay technique, Selective Projection Decay (SPD)
SPD selectively imposes a strong penalty on certain layers while allowing others to change freely.
When equipped with SPD, Adam consistently provides better in-distribution robustness and out-of-distribution performance on benchmarks.
arXiv Detail & Related papers (2024-11-03T23:36:53Z) - Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise.
Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise.
We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z) - Adaptive Preference Scaling for Reinforcement Learning with Human Feedback [103.36048042664768]
Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values.
We propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO)
Our method is versatile and can be readily adapted to various preference optimization frameworks.
arXiv Detail & Related papers (2024-06-04T20:33:22Z) - Improving Robustness with Adaptive Weight Decay [8.096469295357737]
We propose adaptive weight decay, which automatically tunes the hyper- parameter iteration for weight decay during each training.
We show that this simple modification can result in large improvements in robustness.
This method has other desirable properties, such as less sensitivity to learning rate, and smaller weight norms.
arXiv Detail & Related papers (2022-09-30T21:13:00Z) - STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions.
Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z) - AdamP: Slowing Down the Slowdown for Momentum Optimizers on
Scale-invariant Weights [53.8489656709356]
Normalization techniques are a boon for modern deep learning.
It is often overlooked, however, that the additional introduction of momentum results in a rapid reduction in effective step sizes for scale-invariant weights.
In this paper, we verify that the widely-adopted combination of the two ingredients lead to the premature decay of effective step sizes and sub-optimal model performances.
arXiv Detail & Related papers (2020-06-15T08:35:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.