BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization
- URL: http://arxiv.org/abs/2305.18666v2
- Date: Thu, 2 Nov 2023 04:23:07 GMT
- Title: BiSLS/SPS: Auto-tune Step Sizes for Stable Bi-level Optimization
- Authors: Chen Fan, Gaspard Chon\'e-Ducasse, Mark Schmidt, Christos
Thrampoulidis
- Abstract summary: Existing algorithms involve two coupled learning rates that can be affected by approximation errors when computing hypergradients.
We investigate the use of adaptive step-size methods, namely line search (SLS) and Polyak step size (SPS), for computing both the upper and lower-level learning rates.
New algorithms, which are available in both SGD and Adam versions, can find large learning rates with minimal tuning and converge faster than corresponding vanilla BO algorithms.
- Score: 33.082961718280245
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The popularity of bi-level optimization (BO) in deep learning has spurred a
growing interest in studying gradient-based BO algorithms. However, existing
algorithms involve two coupled learning rates that can be affected by
approximation errors when computing hypergradients, making careful fine-tuning
necessary to ensure fast convergence. To alleviate this issue, we investigate
the use of recently proposed adaptive step-size methods, namely stochastic line
search (SLS) and stochastic Polyak step size (SPS), for computing both the
upper and lower-level learning rates. First, we revisit the use of SLS and SPS
in single-level optimization without the additional interpolation condition
that is typically assumed in prior works. For such settings, we investigate new
variants of SLS and SPS that improve upon existing suggestions in the
literature and are simpler to implement. Importantly, these two variants can be
seen as special instances of general family of methods with an envelope-type
step-size. This unified envelope strategy allows for the extension of the
algorithms and their convergence guarantees to BO settings. Finally, our
extensive experiments demonstrate that the new algorithms, which are available
in both SGD and Adam versions, can find large learning rates with minimal
tuning and converge faster than corresponding vanilla SGD or Adam BO algorithms
that require fine-tuning.
Related papers
- Provably Faster Algorithms for Bilevel Optimization via Without-Replacement Sampling [96.47086913559289]
gradient-based algorithms are widely used in bilevel optimization.
We introduce a without-replacement sampling based algorithm which achieves a faster convergence rate.
We validate our algorithms over both synthetic and real-world applications.
arXiv Detail & Related papers (2024-11-07T17:05:31Z) - Tuning-Free Bilevel Optimization: New Algorithms and Convergence Analysis [21.932550214810533]
We propose two novel tuning-free algorithms, D-TFBO and S-TFBO.
D-TFBO employs a double-loop structure with stepsizes adaptively adjusted by the "inverse of cumulative gradient norms" strategy.
S-TFBO features a simpler fully single-loop structure that updates three variables simultaneously with a theory-motivated joint design of adaptive stepsizes for all variables.
arXiv Detail & Related papers (2024-10-07T15:50:30Z) - Layer-wise Adaptive Step-Sizes for Stochastic First-Order Methods for
Deep Learning [8.173034693197351]
We propose a new per-layer adaptive step-size procedure for first-order optimization methods in deep learning.
The proposed approach exploits the layer-wise curvature information contained in the diagonal blocks of the Hessian in deep neural networks (DNNs) to compute adaptive step-sizes (i.e., LRs) for each layer.
Numerical experiments show that SGD with momentum and AdamW combined with the proposed per-layer step-sizes are able to choose effective LR schedules.
arXiv Detail & Related papers (2023-05-23T04:12:55Z) - Formal guarantees for heuristic optimization algorithms used in machine
learning [6.978625807687497]
Gradient Descent (SGD) and its variants have become the dominant methods in the large-scale optimization machine learning (ML) problems.
We provide formal guarantees of a few convex optimization methods and proposing improved algorithms.
arXiv Detail & Related papers (2022-07-31T19:41:22Z) - Adaptive First- and Second-Order Algorithms for Large-Scale Machine
Learning [3.0204520109309843]
We consider first- and second-order techniques to address continuous optimization problems in machine learning.
In the first-order case, we propose a framework of transition from semi-deterministic to quadratic regularization methods.
In the second-order case, we propose a novel first-order algorithm with adaptive sampling and adaptive step size.
arXiv Detail & Related papers (2021-11-29T18:10:00Z) - ES-Based Jacobian Enables Faster Bilevel Optimization [53.675623215542515]
Bilevel optimization (BO) has arisen as a powerful tool for solving many modern machine learning problems.
Existing gradient-based methods require second-order derivative approximations via Jacobian- or/and Hessian-vector computations.
We propose a novel BO algorithm, which adopts Evolution Strategies (ES) based method to approximate the response Jacobian matrix in the hypergradient of BO.
arXiv Detail & Related papers (2021-10-13T19:36:50Z) - Lower Bounds and Optimal Algorithms for Smooth and Strongly Convex
Decentralized Optimization Over Time-Varying Networks [79.16773494166644]
We consider the task of minimizing the sum of smooth and strongly convex functions stored in a decentralized manner across the nodes of a communication network.
We design two optimal algorithms that attain these lower bounds.
We corroborate the theoretical efficiency of these algorithms by performing an experimental comparison with existing state-of-the-art methods.
arXiv Detail & Related papers (2021-06-08T15:54:44Z) - Learning Sampling Policy for Faster Derivative Free Optimization [100.27518340593284]
We propose a new reinforcement learning based ZO algorithm (ZO-RL) with learning the sampling policy for generating the perturbations in ZO optimization instead of using random sampling.
Our results show that our ZO-RL algorithm can effectively reduce the variances of ZO gradient by learning a sampling policy, and converge faster than existing ZO algorithms in different scenarios.
arXiv Detail & Related papers (2021-04-09T14:50:59Z) - Single-Timescale Stochastic Nonconvex-Concave Optimization for Smooth
Nonlinear TD Learning [145.54544979467872]
We propose two single-timescale single-loop algorithms that require only one data point each step.
Our results are expressed in a form of simultaneous primal and dual side convergence.
arXiv Detail & Related papers (2020-08-23T20:36:49Z) - Balancing Rates and Variance via Adaptive Batch-Size for Stochastic
Optimization Problems [120.21685755278509]
In this work, we seek to balance the fact that attenuating step-size is required for exact convergence with the fact that constant step-size learns faster in time up to an error.
Rather than fixing the minibatch the step-size at the outset, we propose to allow parameters to evolve adaptively.
arXiv Detail & Related papers (2020-07-02T16:02:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.