Explainable Learning Rate Regimes for Stochastic Optimization
- URL: http://arxiv.org/abs/2508.13639v1
- Date: Tue, 19 Aug 2025 08:51:38 GMT
- Title: Explainable Learning Rate Regimes for Stochastic Optimization
- Authors: Zhuang Yang,
- Abstract summary: Modern machine learning is trained by gradient descent (SGD), whose performance depends on how the learning rate (LR) is adjusted and decreased over time.<n>Existing LR regimes may be intricate, or need to tune one or more additional hyper- parameters manually.<n>This work clarifies how LR should be updated automatically only according to the intrinsic variation of gradients.
- Score: 12.836088204932842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern machine learning is trained by stochastic gradient descent (SGD), whose performance critically depends on how the learning rate (LR) is adjusted and decreased over time. Yet existing LR regimes may be intricate, or need to tune one or more additional hyper-parameters manually whose bottlenecks include huge computational expenditure, time and power in practice. This work, in a natural and direct manner, clarifies how LR should be updated automatically only according to the intrinsic variation of stochastic gradients. An explainable LR regime by leveraging stochastic second-order algorithms is developed, behaving a similar pattern to heuristic algorithms but implemented simply without any parameter tuning requirement, where it is of an automatic procedure that LR should increase (decrease) as the norm of stochastic gradients decreases (increases). The resulting LR regime shows its efficiency, robustness, and scalability in different classical stochastic algorithms, containing SGD, SGDM, and SIGNSGD, on machine learning tasks.
Related papers
- Efficient quantum machine learning with inverse-probability algebraic corrections [2.7412662946127764]
Quantum neural networks (QNNs) provide expressive probabilistic models by leveraging quantum superposition and entanglement.<n>Quantum neural networks (QNNs) provide expressive probabilistic models by leveraging quantum superposition and entanglement.<n>Existing training approaches largely rely on gradient-based procedural optimization.
arXiv Detail & Related papers (2026-01-23T11:28:53Z) - A Simplified Analysis of SGD for Linear Regression with Weight Averaging [64.2393952273612]
Recent work bycitetzou 2021benign provides sharp rates for SGD optimization in linear regression using constant learning rate.<n>We provide a simplified analysis recovering the same bias and variance bounds provided incitepzou 2021benign based on simple linear algebra tools.<n>We believe our work makes the analysis of gradient descent on linear regression very accessible and will be helpful in further analyzing mini-batching and learning rate scheduling.
arXiv Detail & Related papers (2025-06-18T15:10:38Z) - AutoSGD: Automatic Learning Rate Selection for Stochastic Gradient Descent [58.05410015124021]
We introduce AutoSGD: an SGD method that automatically determines whether to increase or decrease the learning rate at a given iteration.<n> Empirical results suggest strong performance of the method on a variety of traditional optimization problems and machine learning tasks.
arXiv Detail & Related papers (2025-05-27T18:25:21Z) - Efficient Differentiable Approximation of Generalized Low-rank Regularization [64.73416824444328]
Low-rank regularization (LRR) has been widely applied in various machine learning tasks.<n>In this paper, we propose an efficient differentiable approximation of LRR.
arXiv Detail & Related papers (2025-05-21T11:49:17Z) - A Stochastic Approach to Bi-Level Optimization for Hyperparameter Optimization and Meta Learning [74.80956524812714]
We tackle the general differentiable meta learning problem that is ubiquitous in modern deep learning.
These problems are often formalized as Bi-Level optimizations (BLO)
We introduce a novel perspective by turning a given BLO problem into a ii optimization, where the inner loss function becomes a smooth distribution, and the outer loss becomes an expected loss over the inner distribution.
arXiv Detail & Related papers (2024-10-14T12:10:06Z) - Enhancing Zeroth-order Fine-tuning for Language Models with Low-rank Structures [21.18741772731095]
Zeroth-order (ZO) algorithms offer a promising alternative by approximating gradients using finite differences of function values.
Existing ZO methods struggle to capture the low-rank gradient structure common in LLM fine-tuning, leading to suboptimal performance.
This paper proposes a low-rank ZO algorithm (LOZO) that effectively captures this structure in LLMs.
arXiv Detail & Related papers (2024-10-10T08:10:53Z) - On the Weight Dynamics of Deep Normalized Networks [5.250288418639077]
High disparities in effective learning rates (ELRs) across layers in deep neural networks can negatively affect trainability.
We formalize how these disparities evolve over time by modeling weight dynamics of networks with normalization layers.
We prove that when training with any constant learning rate, ELR ratios converge to 1, despite initial gradient explosion.
arXiv Detail & Related papers (2023-06-01T14:09:52Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - Convergence of ease-controlled Random Reshuffling gradient Algorithms under Lipschitz smoothness [0.0]
We consider the average of a very large number of smooth possibly non-size functions, and we use two widely minibatch frameworks to tackle this problem.
We define ease-controlled modifications of IG/RR schemes, which require a light additional computational effort.
We prove our implementation with both a full batch gradient (i.e. L-BFGS) and an implementation of IG/RR methods, proving that algorithms require a similar computational effort.
arXiv Detail & Related papers (2022-12-04T15:26:36Z) - Fast Distributionally Robust Learning with Variance Reduced Min-Max
Optimization [85.84019017587477]
Distributionally robust supervised learning is emerging as a key paradigm for building reliable machine learning systems for real-world applications.
Existing algorithms for solving Wasserstein DRSL involve solving complex subproblems or fail to make use of gradients.
We revisit Wasserstein DRSL through the lens of min-max optimization and derive scalable and efficiently implementable extra-gradient algorithms.
arXiv Detail & Related papers (2021-04-27T16:56:09Z) - Learning Neural Network Quantum States with the Linear Method [0.0]
We show that the linear method can be used successfully for the optimization of complex valued neural network quantum states.
We compare the LM to the state-of-the-art SR algorithm and find that the LM requires up to an order of magnitude fewer iterations for convergence.
arXiv Detail & Related papers (2021-04-22T12:18:33Z) - MLR-SNet: Transferable LR Schedules for Heterogeneous Tasks [56.66010634895913]
The learning rate (LR) is one of the most important hyper-learned network parameters in gradient descent (SGD) training networks (DNN)
In this paper, we propose to learn a proper LR schedule for MLR-SNet tasks.
We also make MLR-SNet to query tasks like different noises, architectures, data modalities, sizes from the training ones, and achieve or even better performance.
arXiv Detail & Related papers (2020-07-29T01:18:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.