Efficient Matrix-Free Approximations of Second-Order Information, with
Applications to Pruning and Optimization
- URL: http://arxiv.org/abs/2107.03356v3
- Date: Fri, 9 Jul 2021 09:38:58 GMT
- Title: Efficient Matrix-Free Approximations of Second-Order Information, with
Applications to Pruning and Optimization
- Authors: Elias Frantar, Eldar Kurtic, Dan Alistarh
- Abstract summary: We investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs)
These algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.
- Score: 16.96639526117016
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently approximating local curvature information of the loss function is
a key tool for optimization and compression of deep neural networks. Yet, most
existing methods to approximate second-order information have high
computational or storage costs, which can limit their practicality. In this
work, we investigate matrix-free, linear-time approaches for estimating
Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be
approximated as a sum of rank-one matrices, as in the classic approximation of
the Hessian by the empirical Fisher matrix. We propose two new algorithms as
part of a framework called M-FAC: the first algorithm is tailored towards
network compression and can compute the IHVP for dimension $d$, if the Hessian
is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation,
$O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single
element of the inverse Hessian. The second algorithm targets an optimization
setting, where we wish to compute the product between the inverse Hessian,
estimated over a sliding window of optimization steps, and a given gradient
direction, as required for preconditioned SGD. We give an algorithm with cost
$O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing
any gradient from the sliding window. These two algorithms yield
state-of-the-art results for network pruning and optimization with lower
computational overhead relative to existing second-order methods.
Implementations are available at [10] and [18].
Related papers
- Obtaining Lower Query Complexities through Lightweight Zeroth-Order Proximal Gradient Algorithms [65.42376001308064]
We propose two variance reduced ZO estimators for complex gradient problems.
We improve the state-of-the-art function complexities from $mathcalOleft(minfracdn1/2epsilon2, fracdepsilon3right)$ to $tildecalOleft(fracdepsilon2right)$.
arXiv Detail & Related papers (2024-10-03T15:04:01Z) - AdaSub: Stochastic Optimization Using Second-Order Information in
Low-Dimensional Subspaces [0.0]
We introduce AdaSub, a search algorithm that computes a search direction based on second-order information in a low-dimensional subspace.
Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses.
Our preliminary numerical results demonstrate that AdaSub surpasses popular iterations in terms of time and number of iterations required to reach a given accuracy.
arXiv Detail & Related papers (2023-10-30T22:24:23Z) - Stochastic Optimization for Non-convex Problem with Inexact Hessian
Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties.
We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z) - Representing Additive Gaussian Processes by Sparse Matrices [18.618437338490487]
Mat'ern Gaussian Processes (GPs) are one of the most popular scalable high-dimensional problems.
Back-fitting algorithms can reduce the time complexity of computing the posterior mean from $O(n3)$ to $O(nlog n)$ time.
Generalizing these algorithms to efficiently compute the posterior variance and maximum log-likelihood remains an open problem.
arXiv Detail & Related papers (2023-04-29T18:53:42Z) - Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications.
This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z) - Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes.
This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z) - Improved Rate of First Order Algorithms for Entropic Optimal Transport [2.1485350418225244]
This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport.
We propose an accelerated primal-dual mirror descent algorithm with variance reduction.
Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $widetildeO(n2/epsilon)$ for solving OT.
arXiv Detail & Related papers (2023-01-23T19:13:25Z) - Alternating Mahalanobis Distance Minimization for Stable and Accurate CP
Decomposition [4.847980206213335]
We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work.
We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank.
We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors.
arXiv Detail & Related papers (2022-04-14T19:56:36Z) - 2nd-order Updates with 1st-order Complexity [0.0]
It has long been a goal to efficiently compute and use second order information on a function ($f$) to assist in numerical approximations.
Here it is shown how, using only basic physics and a numerical approximation, such information can be accurately obtained at a cost of $cal O(N)$ complexity.
arXiv Detail & Related papers (2021-05-24T17:47:51Z) - Hutch++: Optimal Stochastic Trace Estimation [75.45968495410048]
We introduce a new randomized algorithm, Hutch++, which computes a $(1 pm epsilon)$ approximation to $tr(A)$ for any positive semidefinite (PSD) $A$.
We show that it significantly outperforms Hutchinson's method in experiments.
arXiv Detail & Related papers (2020-10-19T16:45:37Z) - Private Stochastic Convex Optimization: Optimal Rates in Linear Time [74.47681868973598]
We study the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions.
A recent work of Bassily et al. has established the optimal bound on the excess population loss achievable given $n$ samples.
We describe two new techniques for deriving convex optimization algorithms both achieving the optimal bound on excess loss and using $O(minn, n2/d)$ gradient computations.
arXiv Detail & Related papers (2020-05-10T19:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.