Related papers: A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods

A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods

URL: http://arxiv.org/abs/2412.00059v2
Date: Mon, 19 May 2025 07:29:00 GMT
Title: A Learn-to-Optimize Approach for Coordinate-Wise Step Sizes for Quasi-Newton Methods
Authors: Wei Lin, Qingyu Song, Hong Xu,
Abstract summary: We introduce a learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes.<n>Our approach achieves substantial improvements over scalar step size methods and hypergradient descent-based method.
Score: 9.82454981262489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Tuning step sizes is crucial for the stability and efficiency of optimization algorithms. While adaptive coordinate-wise step sizes have been shown to outperform scalar step size in first-order methods, their use in second-order methods is still under-explored and more challenging. Current approaches, including hypergradient descent and cutting plane methods, offer limited improvements or encounter difficulties in second-order contexts. To address these limitations, we first conduct a theoretical analysis within the Broyden-Fletcher-Goldfarb-Shanno (BFGS) framework, a prominent quasi-Newton method, and derive sufficient conditions for coordinate-wise step sizes that ensure convergence and stability. Building on this theoretical foundation, we introduce a novel learn-to-optimize (L2O) method that employs LSTM-based networks to learn optimal step sizes by leveraging insights from past optimization trajectories, while inherently respecting the derived theoretical guarantees. Extensive experiments demonstrate that our approach achieves substantial improvements over scalar step size methods and hypergradient descent-based method, offering up to 4$\times$ faster convergence across diverse optimization tasks.

Related papers

Neural Network Training via Stochastic Alternating Minimization with Trainable Step Sizes [3.246129789918632]
The training of deep neural networks is inherently a non- optimization problem.<n>Standard approaches such as gradient descent (SGD) require simultaneous updates to parameters.<n>We propose a novel method Alternating Train Miniization with tailored step sizes (SAMT)<n>SAMT achieves better performance with fewer parameter updates compared to state-of-the-art methods.
arXiv Detail & Related papers (2025-08-06T08:23:38Z)
Gradient Methods with Online Scaling Part I. Theoretical Foundations [19.218484733179356]
This paper establishes the theoretical foundations of the online scaled methods (OSGM)<n>OSGM quantifies the effectiveness of a stepsize by a feedback function motivated from a convergence measure and uses the feedback to adjust the stepsize through an online learning algorithm.<n>OSGM yields desirable convergence guarantees on smooth convex problems, including 1) trajectory-dependent global convergence on smooth convex objectives; 2) an improved complexity result on smooth strongly convex problems, and 3) local superlinear convergence.
arXiv Detail & Related papers (2025-05-29T04:35:21Z)
Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems.
arXiv Detail & Related papers (2025-02-11T13:10:34Z)
Aligning Few-Step Diffusion Models with Dense Reward Difference Learning [81.85515625591884]
Stepwise Diffusion Policy Optimization (SDPO) is an alignment method tailored for few-step diffusion models. SDPO incorporates dense reward feedback at every intermediate step to ensure consistent alignment across all denoising steps. SDPO consistently outperforms prior methods in reward-based alignment across diverse step configurations.
arXiv Detail & Related papers (2024-11-18T16:57:41Z)
An Adaptive Dimension Reduction Estimation Method for High-dimensional Bayesian Optimization [6.79843988450982]
We propose a two-step optimization framework to extend BO to high-dimensional settings. Our algorithm offers the flexibility to operate these steps either concurrently or in sequence. Numerical experiments validate the efficacy of our method in challenging scenarios.
arXiv Detail & Related papers (2024-03-08T16:21:08Z)
Double Duality: Variational Primal-Dual Policy Optimization for Constrained Reinforcement Learning [132.7040981721302]
We study the Constrained Convex Decision Process (MDP), where the goal is to minimize a convex functional of the visitation measure. Design algorithms for a constrained convex MDP faces several challenges, including handling the large state space.
arXiv Detail & Related papers (2024-02-16T16:35:18Z)
SANIA: Polyak-type Optimization Framework Leads to Scale Invariant Stochastic Algorithms [1.21748738176366]
Techniques such as Adam, AdaGrad, and AdaHessian utilize a preconditioner that the search impacts by incorporating the curvature of the objective function. This paper proposes SANIA to tackle these challenges.
arXiv Detail & Related papers (2023-12-28T21:28:08Z)
Locally Optimal Descent for Dynamic Stepsize Scheduling [45.6809308002043]
We introduce a novel dynamic learning scheduling scheme grounded in theory with the goal of simplifying the manual and time-consuming tuning of schedules in phases. Our approach is based on estimating a locally-optimal practice tuning-rate in the direction of a smooth gradient. Our findings indicate that our method needs minimal tuning when compared to existing approaches.
arXiv Detail & Related papers (2023-11-23T09:57:35Z)
Adaptive Learning Rates for Faster Stochastic Gradient Methods [6.935471115003109]
We propose new adaptive step size strategies that improve several quadratic convex gradient methods. Our first method is based on the classical Polyak step size (Polyak, 1987) and is an extension of the recent development of this method. Our second method, denoted GraDS, rescales step size by "diversity of gradients"
arXiv Detail & Related papers (2022-08-10T11:36:00Z)
Formal guarantees for heuristic optimization algorithms used in machine learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems. We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z)
Local Quadratic Convergence of Stochastic Gradient Descent with Adaptive Step Size [29.15132344744801]
We establish local convergence for gradient descent with adaptive step size for problems such as matrix inversion. We show that these first order optimization methods can achieve sub-linear or linear convergence.
arXiv Detail & Related papers (2021-12-30T00:50:30Z)
Bolstering Stochastic Gradient Descent with Model Building [0.0]
gradient descent method and its variants constitute the core optimization algorithms that achieve good convergence rates. We propose an alternative approach to line search by using a new algorithm based on forward step model building. We show that the proposed algorithm achieves faster convergence and better generalization in well-known test problems.
arXiv Detail & Related papers (2021-11-13T06:54:36Z)
On Constraints in First-Order Optimization: A View from Non-Smooth Dynamical Systems [99.59934203759754]
We introduce a class of first-order methods for smooth constrained optimization. Two distinctive features of our approach are that projections or optimizations over the entire feasible set are avoided. The resulting algorithmic procedure is simple to implement even when constraints are nonlinear.
arXiv Detail & Related papers (2021-07-17T11:45:13Z)
Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator [75.05106948314956]
We show that an increasing large momentum parameter for the first-order moment is sufficient for adaptive scaling. We also give insights for increasing the momentum in a stagewise manner in accordance with stagewise decreasing step size.
arXiv Detail & Related papers (2021-04-30T08:50:24Z)
Acceleration Methods [57.202881673406324]
We first use quadratic optimization problems to introduce two key families of acceleration methods. We discuss momentum methods in detail, starting with the seminal work of Nesterov. We conclude by discussing restart schemes, a set of simple techniques for reaching nearly optimal convergence rates.
arXiv Detail & Related papers (2021-01-23T17:58:25Z)
Cogradient Descent for Bilinear Optimization [124.45816011848096]
We introduce a Cogradient Descent algorithm (CoGD) to address the bilinear problem. We solve one variable by considering its coupling relationship with the other, leading to a synchronous gradient descent. Our algorithm is applied to solve problems with one variable under the sparsity constraint.
arXiv Detail & Related papers (2020-06-16T13:41:54Z)
Effective Dimension Adaptive Sketching Methods for Faster Regularized Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching. We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.