SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation
- URL: http://arxiv.org/abs/2502.18153v1
- Date: Tue, 25 Feb 2025 12:35:05 GMT
- Title: SASSHA: Sharpness-aware Adaptive Second-order Optimization with Stable Hessian Approximation
- Authors: Dahun Shin, Dongyeop Lee, Jinseok Chung, Namhoon Lee,
- Abstract summary: Sassha is a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution.<n>We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.
- Score: 5.523554757946985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Approximate second-order optimization methods often exhibit poorer generalization compared to first-order approaches. In this work, we look into this issue through the lens of the loss landscape and find that existing second-order methods tend to converge to sharper minima compared to SGD. In response, we propose Sassha, a novel second-order method designed to enhance generalization by explicitly reducing sharpness of the solution, while stabilizing the computation of approximate Hessians along the optimization trajectory. In fact, this sharpness minimization scheme is crafted also to accommodate lazy Hessian updates, so as to secure efficiency besides flatness. To validate its effectiveness, we conduct a wide range of standard deep learning experiments where Sassha demonstrates its outstanding generalization performance that is comparable to, and mostly better than, other methods. We provide a comprehensive set of analyses including convergence, robustness, stability, efficiency, and cost.
Related papers
- Bayesian Optimization for Non-Convex Two-Stage Stochastic Optimization Problems [2.9016548477524156]
We formulate a knowledge-gradient-based acquisition function to jointly optimize the first variables, establish a guarantee of consistency, and provide an approximation.<n>We demonstrate comparable empirical results to an alternative we formulate with fewer focuss, which alternates between the two variable types.
arXiv Detail & Related papers (2024-08-30T16:26:31Z) - Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning [6.383513606898132]
Second-order information is valuable for many applications but challenging to compute.
We introduce HesScale, an improvement over BL89, which adds negligible extra computation.
On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute.
arXiv Detail & Related papers (2024-06-05T13:53:20Z) - One-Shot Safety Alignment for Large Language Models via Optimal Dualization [64.52223677468861]
This paper presents a perspective of dualization that reduces constrained alignment to an equivalent unconstrained alignment problem.
We do so by pre-optimizing a smooth and convex dual function that has a closed form.
Our strategy leads to two practical algorithms in model-based and preference-based settings.
arXiv Detail & Related papers (2024-05-29T22:12:52Z) - Optimal Guarantees for Algorithmic Reproducibility and Gradient
Complexity in Convex Optimization [55.115992622028685]
Previous work suggests that first-order methods would need to trade-off convergence rate (gradient convergence rate) for better.
We demonstrate that both optimal complexity and near-optimal convergence guarantees can be achieved for smooth convex minimization and smooth convex-concave minimax problems.
arXiv Detail & Related papers (2023-10-26T19:56:52Z) - Gradient constrained sharpness-aware prompt learning for vision-language
models [99.74832984957025]
This paper targets a novel trade-off problem in generalizable prompt learning for vision-language models (VLM)
By analyzing the loss landscapes of the state-of-the-art method and vanilla Sharpness-aware Minimization (SAM) based method, we conclude that the trade-off performance correlates to both loss value and loss sharpness.
We propose a novel SAM-based method for prompt learning, denoted as Gradient Constrained Sharpness-aware Context Optimization (GCSCoOp)
arXiv Detail & Related papers (2023-09-14T17:13:54Z) - An Adaptive Incremental Gradient Method With Support for Non-Euclidean
Norms [19.41328109094503]
We propose and analyze several novel adaptive variants of the popular SAGA algorithm.
We establish its convergence guarantees under general settings.
We improve the analysis of SAGA to support non-Euclidean norms.
arXiv Detail & Related papers (2022-04-28T09:43:07Z) - Outlier-Robust Sparse Estimation via Non-Convex Optimization [73.18654719887205]
We explore the connection between high-dimensional statistics and non-robust optimization in the presence of sparsity constraints.
We develop novel and simple optimization formulations for these problems.
As a corollary, we obtain that any first-order method that efficiently converges to station yields an efficient algorithm for these tasks.
arXiv Detail & Related papers (2021-09-23T17:38:24Z) - High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability.
Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level.
We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z) - Variance Regularization for Accelerating Stochastic Optimization [14.545770519120898]
We propose a universal principle which reduces the random error accumulation by exploiting statistic information hidden in mini-batch gradients.
This is achieved by regularizing the learning-rate according to mini-batch variances.
arXiv Detail & Related papers (2020-08-13T15:34:01Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z) - The Strength of Nesterov's Extrapolation in the Individual Convergence
of Nonsmooth Optimization [0.0]
We prove that Nesterov's extrapolation has the strength to make the individual convergence of gradient descent methods optimal for nonsmooth problems.
We give an extension of the derived algorithms to solve regularized learning tasks with nonsmooth losses in settings.
Our method is applicable as an efficient tool for solving large-scale $l$1-regularized hinge-loss learning problems.
arXiv Detail & Related papers (2020-06-08T03:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.