NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
- URL: http://arxiv.org/abs/2506.08360v1
- Date: Tue, 10 Jun 2025 02:18:08 GMT
- Title: NysAct: A Scalable Preconditioned Gradient Descent using Nystrom Approximation
- Authors: Hyunseok Seung, Jaewoo Lee, Hyunsuk Ko,
- Abstract summary: We introduce NysAct, a scalable first-order gradient preconditioning method.<n>We show that NysAct achieves improved test accuracy compared to both first-order and second-order methods.
- Score: 7.512116180634991
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Adaptive gradient methods are computationally efficient and converge quickly, but they often suffer from poor generalization. In contrast, second-order methods enhance convergence and generalization but typically incur high computational and memory costs. In this work, we introduce NysAct, a scalable first-order gradient preconditioning method that strikes a balance between state-of-the-art first-order and second-order optimization methods. NysAct leverages an eigenvalue-shifted Nystrom method to approximate the activation covariance matrix, which is used as a preconditioning matrix, significantly reducing time and memory complexities with minimal impact on test accuracy. Our experiments show that NysAct not only achieves improved test accuracy compared to both first-order and second-order methods but also demands considerably less computational resources than existing second-order methods. Code is available at https://github.com/hseung88/nysact.
Related papers
- Towards Practical Second-Order Optimizers in Deep Learning: Insights from Fisher Information Analysis [0.0]
We present AdaFisher, a novel adaptive second-order tuning for deep neural networks (DNNs)<n>AdaFisher aims to bridge the gap between the improved convergence and generalization of second-order methods and the computational efficiency needed for trainings.<n>We demonstrate that AdaFisher outperforms state-of-the-art approximations in both accuracy and convergence speed.
arXiv Detail & Related papers (2025-04-26T05:02:21Z) - FUSE: First-Order and Second-Order Unified SynthEsis in Stochastic Optimization [9.909119107223265]
First-order and second-order methods are in entirely different situations.<n>This paper presents a novel method that leverages both the first-order and second-order methods in a unified algorithmic framework.<n>FUSE-PV stands as a simple yet efficient optimization method involving a switch-over between first and second orders.
arXiv Detail & Related papers (2025-03-06T08:30:18Z) - Inverse-Free Fast Natural Gradient Descent Method for Deep Learning [52.0693420699086]
We present a fast natural gradient descent (FNGD) method that only requires inversion during the first epoch.
FNGD exhibits similarities to the average sum in first-order methods, leading to the computational complexity of FNGD being comparable to that of first-order methods.
arXiv Detail & Related papers (2024-03-06T05:13:28Z) - A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner.
We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z) - Rethinking SIGN Training: Provable Nonconvex Acceleration without First-
and Second-Order Gradient Lipschitz [66.22095739795068]
Sign-based methods have gained attention due to their ability to achieve robust performance despite only using only the sign information for parameter updates.
The current convergence analysis of sign-based methods relies on the strong assumptions of first-order acceleration and second-order acceleration.
In this paper we analyze their convergence under more realistic assumptions of first- and second-order acceleration.
arXiv Detail & Related papers (2023-10-23T06:48:43Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - ELRA: Exponential learning rate adaption gradient descent optimization
method [83.88591755871734]
We present a novel, fast (exponential rate), ab initio (hyper-free) gradient based adaption.
The main idea of the method is to adapt the $alpha by situational awareness.
It can be applied to problems of any dimensions n and scales only linearly.
arXiv Detail & Related papers (2023-09-12T14:36:13Z) - Nystrom Method for Accurate and Scalable Implicit Differentiation [25.29277451838466]
We show that the Nystrom method consistently achieves comparable or even superior performance to other approaches.
The proposed method avoids numerical instability and can be efficiently computed in matrix operations without iterations.
arXiv Detail & Related papers (2023-02-20T02:37:26Z) - Explicit Second-Order Min-Max Optimization Methods with Optimal Convergence Guarantee [86.05440220344755]
We propose and analyze inexact regularized Newton-type methods for finding a global saddle point of emphcon unconstrained min-max optimization problems.
We show that the proposed methods generate iterates that remain within a bounded set and that the iterations converge to an $epsilon$-saddle point within $O(epsilon-2/3)$ in terms of a restricted function.
arXiv Detail & Related papers (2022-10-23T21:24:37Z) - Stochastic Gradient Methods with Preconditioned Updates [47.23741709751474]
There are several algorithms for such problems, but existing methods often work poorly when badly scaled and/or ill-conditioned.
Here we include preconditionimater based on Hutchinson's approach to approxing the diagonal Hessian.
We prove convergence both when smoothness and the PL condition are assumed.
arXiv Detail & Related papers (2022-06-01T07:38:08Z) - An Accelerated Variance-Reduced Conditional Gradient Sliding Algorithm
for First-order and Zeroth-order Optimization [111.24899593052851]
Conditional gradient algorithm (also known as the Frank-Wolfe algorithm) has recently regained popularity in the machine learning community.
ARCS is the first zeroth-order conditional gradient sliding type algorithms solving convex problems in zeroth-order optimization.
In first-order optimization, the convergence results of ARCS substantially outperform previous algorithms in terms of the number of gradient query oracle.
arXiv Detail & Related papers (2021-09-18T07:08:11Z) - Second-order Neural Network Training Using Complex-step Directional
Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.