KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization
- URL: http://arxiv.org/abs/2305.19416v1
- Date: Tue, 30 May 2023 21:15:45 GMT
- Title: KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization
- Authors: Jonathan Mei, Alexander Moreno, Luke Walters
- Abstract summary: Second orderimations allow parameter update step size and direction to adapt to loss curvature.
Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements.
It takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints.
- Score: 69.47358238222586
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Second order stochastic optimizers allow parameter update step size and
direction to adapt to loss curvature, but have traditionally required too much
memory and compute for deep learning. Recently, Shampoo [Gupta et al., 2018]
introduced a Kronecker factored preconditioner to reduce these requirements: it
is used for large deep models [Anil et al., 2020] and in production [Anil et
al., 2022]. However, it takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints. In this
paper, we propose a novel factorization, Kronecker Approximation-Domination
(KrAD). Using KrAD, we update a matrix that directly approximates the inverse
empirical Fisher matrix (like full matrix AdaGrad), avoiding inversion and
hence 64-bit precision. We then propose KrADagrad$^\star$, with similar
computational costs to Shampoo and the same regret. Synthetic ill-conditioned
experiments show improved performance over Shampoo for 32-bit precision, while
for several real datasets we have comparable or better generalization.
Related papers
- Highly Adaptive Ridge [84.38107748875144]
We propose a regression method that achieves a $n-2/3$ dimension-free L2 convergence rate in the class of right-continuous functions with square-integrable sectional derivatives.
Har is exactly kernel ridge regression with a specific data-adaptive kernel based on a saturated zero-order tensor-product spline basis expansion.
We demonstrate empirical performance better than state-of-the-art algorithms for small datasets in particular.
arXiv Detail & Related papers (2024-10-03T17:06:06Z) - Curvature-Informed SGD via General Purpose Lie-Group Preconditioners [6.760212042305871]
We present a novel approach to accelerate gradient descent (SGD) by utilizing curvature information.
Our approach involves two preconditioners: a matrix-free preconditioner and a low-rank approximation preconditioner.
We demonstrate that Preconditioned SGD (PSGD) outperforms SoTA on Vision, NLP, and RL tasks.
arXiv Detail & Related papers (2024-02-07T03:18:00Z) - Ginger: An Efficient Curvature Approximation with Linear Complexity for
General Neural Networks [33.96967104979137]
Second-order optimization approaches like the Gauss-Newton method are considered more powerful as they utilize the curvature information of the objective function.
We propose Ginger, eigende for approximation of the generalized generalized Gauss-Newton matrix.
arXiv Detail & Related papers (2024-02-05T18:51:17Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Monarch: Expressive Structured Matrices for Efficient and Accurate
Training [64.6871423399431]
Large neural networks excel in many domains, but they are expensive to train and fine-tune.
A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones.
We propose a class of matrices (Monarch) that is hardware-efficient.
arXiv Detail & Related papers (2022-04-01T17:37:29Z) - Sparse Factorization of Large Square Matrices [10.94053598642913]
In this paper, we propose to approximate a large square matrix with a product of sparse full-rank matrices.
In the approximation, our method needs only $N(log N)2$ non-zero numbers for an $Ntimes N$ full matrix.
We show that our method gives a better approximation when the approximated matrix is sparse and high-rank.
arXiv Detail & Related papers (2021-09-16T18:42:21Z) - Fast Low-Rank Tensor Decomposition by Ridge Leverage Score Sampling [5.740578698172382]
We study Tucker decompositions and use tools from randomized numerical linear algebra called ridge leverage scores.
We show how to use approximate ridge leverage scores to construct a sketched instance for any ridge regression problem.
We demonstrate the effectiveness of our approximate ridge regressioni algorithm for large, low-rank Tucker decompositions on both synthetic and real-world data.
arXiv Detail & Related papers (2021-07-22T13:32:47Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z) - The flare Package for High Dimensional Linear Regression and Precision
Matrix Estimation in R [45.24529956312764]
This paper describes an R package named flare, which implements a family of new high dimensional regression methods.
The package flare is coded in double precision C, and called from R by a user-friendly interface.
Experiments show that flare is efficient and can scale up to large problems.
arXiv Detail & Related papers (2020-06-27T18:01:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.