Related papers: Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization

Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization

URL: http://arxiv.org/abs/2107.03356v3
Date: Fri, 9 Jul 2021 09:38:58 GMT
Title: Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization
Authors: Elias Frantar, Eldar Kurtic, Dan Alistarh
Abstract summary: We investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) These algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods.
Score: 16.96639526117016
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [10] and [18].

Related papers

Quantum Algorithms for Projection-Free Sparse Convex Optimization [32.34794896079469]
For the vector domain, we propose two quantum algorithms for sparse constraints that find a $varepsilon$-optimal solution with the query complexity of $O(sqrtd/varepsilon)$.<n>For the matrix domain, we propose two quantum algorithms for nuclear norm constraints that improve the time complexity to $tildeO(rd/varepsilon2)$ and $tildeO(sqrtrd/varepsilon3)$.
arXiv Detail & Related papers (2025-07-11T12:43:58Z)
Efficiently Solving Discounted MDPs with Predictions on Transition Matrices [6.199300239433395]
We study Discounted Markov Decision Processes (DMDPs) under a generative model. We propose a novel framework to investigate how a prediction on the transition matrix can enhance the sample efficiency in solving DMDPs.
arXiv Detail & Related papers (2025-02-21T09:59:46Z)
Obtaining Lower Query Complexities through Lightweight Zeroth-Order Proximal Gradient Algorithms [65.42376001308064]
We propose two variance reduced ZO estimators for complex gradient problems. We improve the state-of-the-art function complexities from $mathcalOleft(minfracdn1/2epsilon2, fracdepsilon3right)$ to $tildecalOleft(fracdepsilon2right)$.
arXiv Detail & Related papers (2024-10-03T15:04:01Z)
AdaSub: Stochastic Optimization Using Second-Order Information in Low-Dimensional Subspaces [0.0]
We introduce AdaSub, a search algorithm that computes a search direction based on second-order information in a low-dimensional subspace. Compared to first-order methods, second-order methods exhibit better convergence characteristics, but the need to compute the Hessian matrix at each iteration results in excessive computational expenses. Our preliminary numerical results demonstrate that AdaSub surpasses popular iterations in terms of time and number of iterations required to reach a given accuracy.
arXiv Detail & Related papers (2023-10-30T22:24:23Z)
Stochastic Optimization for Non-convex Problem with Inexact Hessian Matrix, Gradient, and Function [99.31457740916815]
Trust-region (TR) and adaptive regularization using cubics have proven to have some very appealing theoretical properties. We show that TR and ARC methods can simultaneously provide inexact computations of the Hessian, gradient, and function values.
arXiv Detail & Related papers (2023-10-18T10:29:58Z)
Representing Additive Gaussian Processes by Sparse Matrices [18.618437338490487]
Mat'ern Gaussian Processes (GPs) are one of the most popular scalable high-dimensional problems. Back-fitting algorithms can reduce the time complexity of computing the posterior mean from $O(n3)$ to $O(nlog n)$ time. Generalizing these algorithms to efficiently compute the posterior variance and maximum log-likelihood remains an open problem.
arXiv Detail & Related papers (2023-04-29T18:53:42Z)
Learning distributed representations with efficient SoftMax normalization [3.8673630752805437]
We propose a linear-time approximation to compute the normalization constants of $rm SoftMax(XYT)$ for embedding vectors with bounded norms.<n>We show on some pre-trained embedding datasets that the proposed estimation method achieves higher or comparable accuracy with competing methods.<n>The proposed algorithm is interpretable and easily adapted to arbitrary embedding problems.
arXiv Detail & Related papers (2023-03-30T15:48:26Z)
Fast Computation of Optimal Transport via Entropy-Regularized Extragradient Methods [75.34939761152587]
Efficient computation of the optimal transport distance between two distributions serves as an algorithm that empowers various applications. This paper develops a scalable first-order optimization-based method that computes optimal transport to within $varepsilon$ additive accuracy.
arXiv Detail & Related papers (2023-01-30T15:46:39Z)
Refined Regret for Adversarial MDPs with Linear Function Approximation [50.00022394876222]
We consider learning in an adversarial Decision Process (MDP) where the loss functions can change arbitrarily over $K$ episodes. This paper provides two algorithms that improve the regret to $tildemathcal O(K2/3)$ in the same setting.
arXiv Detail & Related papers (2023-01-30T14:37:21Z)
Improved Rate of First Order Algorithms for Entropic Optimal Transport [2.1485350418225244]
This paper improves the state-of-the-art rate of a first-order algorithm for solving entropy regularized optimal transport. We propose an accelerated primal-dual mirror descent algorithm with variance reduction. Our algorithm may inspire more research to develop accelerated primal-dual algorithms that have rate $widetildeO(n2/epsilon)$ for solving OT.
arXiv Detail & Related papers (2023-01-23T19:13:25Z)
Alternating Mahalanobis Distance Minimization for Stable and Accurate CP Decomposition [4.847980206213335]
We introduce a new formulation for deriving singular values and vectors of a tensor by considering the critical points of a function different from what is used in the previous work. We show that a subsweep of this algorithm can achieve a superlinear convergence rate for exact CPD with known rank. We then view the algorithm as optimizing a Mahalanobis distance with respect to each factor with ground metric dependent on the other factors.
arXiv Detail & Related papers (2022-04-14T19:56:36Z)
2nd-order Updates with 1st-order Complexity [0.0]
It has long been a goal to efficiently compute and use second order information on a function ($f$) to assist in numerical approximations. Here it is shown how, using only basic physics and a numerical approximation, such information can be accurately obtained at a cost of $cal O(N)$ complexity.
arXiv Detail & Related papers (2021-05-24T17:47:51Z)
Hutch++: Optimal Stochastic Trace Estimation [75.45968495410048]
We introduce a new randomized algorithm, Hutch++, which computes a $(1 pm epsilon)$ approximation to $tr(A)$ for any positive semidefinite (PSD) $A$. We show that it significantly outperforms Hutchinson's method in experiments.
arXiv Detail & Related papers (2020-10-19T16:45:37Z)
Private Stochastic Convex Optimization: Optimal Rates in Linear Time [74.47681868973598]
We study the problem of minimizing the population loss given i.i.d. samples from a distribution over convex loss functions. A recent work of Bassily et al. has established the optimal bound on the excess population loss achievable given $n$ samples. We describe two new techniques for deriving convex optimization algorithms both achieving the optimal bound on excess loss and using $O(minn, n2/d)$ gradient computations.
arXiv Detail & Related papers (2020-05-10T19:52:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.