HesScale: Scalable Computation of Hessian Diagonals
- URL: http://arxiv.org/abs/2210.11639v1
- Date: Thu, 20 Oct 2022 23:50:56 GMT
- Title: HesScale: Scalable Computation of Hessian Diagonals
- Authors: Mohamed Elsayed, A. Rupam Mahmood
- Abstract summary: HesScale is a scalable approach to approximating the diagonal of the Hessian matrix.
We show that HesScale has the same computational complexity as backpropagation.
- Score: 2.398608007786179
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Second-order optimization uses curvature information about the objective
function, which can help in faster convergence. However, such methods typically
require expensive computation of the Hessian matrix, preventing their usage in
a scalable way. The absence of efficient ways of computation drove the most
widely used methods to focus on first-order approximations that do not capture
the curvature information. In this paper, we develop HesScale, a scalable
approach to approximating the diagonal of the Hessian matrix, to incorporate
second-order information in a computationally efficient manner. We show that
HesScale has the same computational complexity as backpropagation. Our results
on supervised classification show that HesScale achieves high approximation
accuracy, allowing for scalable and efficient second-order optimization.
Related papers
- Improving Adaptive Moment Optimization via Preconditioner Diagonalization [11.01832755213396]
We show that our approach can substantially enhance the convergence speed of modern adaptives.
For large language models like LLaMA, we can achieve a speedup of 2x compared to the baseline Adam.
arXiv Detail & Related papers (2025-02-11T11:48:04Z) - Approximating Metric Magnitude of Point Sets [4.522729058300309]
Metric magnitude is a measure of the "size" of point clouds with many desirable geometric properties.
It has been adapted to various mathematical contexts and recent work suggests that it can enhance machine learning and optimization algorithms.
In this paper, we study the magnitude problem, and show efficient ways of approximating it. We show that it can be cast as a convex optimization problem, but not as a submodular optimization.
The paper describes two new algorithms - an iterative approximation algorithm that converges fast and is accurate, and a subset selection method that makes the computation even faster.
arXiv Detail & Related papers (2024-09-06T17:15:28Z) - Revisiting Scalable Hessian Diagonal Approximations for Applications in Reinforcement Learning [6.383513606898132]
Second-order information is valuable for many applications but challenging to compute.
We introduce HesScale, an improvement over BL89, which adds negligible extra computation.
On small networks, we find that this improvement is of higher quality than all alternatives, even those with theoretical guarantees, such as unbiasedness, while being much cheaper to compute.
arXiv Detail & Related papers (2024-06-05T13:53:20Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Learning the Positions in CountSketch [49.57951567374372]
We consider sketching algorithms which first compress data by multiplication with a random sketch matrix, and then apply the sketch to quickly solve an optimization problem.
In this work, we propose the first learning-based algorithms that also optimize the locations of the non-zero entries.
arXiv Detail & Related papers (2023-06-11T07:28:35Z) - Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications.
This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z) - Doubly Adaptive Scaled Algorithm for Machine Learning Using Second-Order
Information [37.70729542263343]
We present a novel adaptive optimization algorithm for large-scale machine learning problems.
Our method dynamically adapts the direction and step-size.
Our methodology does not require a tedious tuning rate tuning.
arXiv Detail & Related papers (2021-09-11T06:39:50Z) - Dual Optimization for Kolmogorov Model Learning Using Enhanced Gradient
Descent [8.714458129632158]
Kolmogorov model (KM) is an interpretable and predictable representation approach to learning the underlying probabilistic structure of a set of random variables.
We propose a computationally scalable KM learning algorithm, based on the regularized dual optimization combined with enhanced gradient descent (GD) method.
It is shown that the accuracy of logical relation mining for interpretability by using the proposed KM learning algorithm exceeds $80%$.
arXiv Detail & Related papers (2021-07-11T10:33:02Z) - Second-order Neural Network Training Using Complex-step Directional
Derivative [41.4333906662624]
We introduce a numerical algorithm for second-order neural network training.
We tackle the practical obstacle of Hessian calculation by using the complex-step finite difference.
We believe our method will inspire a wide-range of new algorithms for deep learning and numerical optimization.
arXiv Detail & Related papers (2020-09-15T13:46:57Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z) - Distributed Averaging Methods for Randomized Second Order Optimization [54.51566432934556]
We consider distributed optimization problems where forming the Hessian is computationally challenging and communication is a bottleneck.
We develop unbiased parameter averaging methods for randomized second order optimization that employ sampling and sketching of the Hessian.
We also extend the framework of second order averaging methods to introduce an unbiased distributed optimization framework for heterogeneous computing systems.
arXiv Detail & Related papers (2020-02-16T09:01:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.