Sketchy: Memory-efficient Adaptive Regularization with Frequent
Directions
- URL: http://arxiv.org/abs/2302.03764v2
- Date: Mon, 16 Oct 2023 23:51:00 GMT
- Title: Sketchy: Memory-efficient Adaptive Regularization with Frequent
Directions
- Authors: Vladimir Feinberg, Xinyi Chen, Y. Jennifer Sun, Rohan Anil, Elad Hazan
- Abstract summary: We find the spectra of the Kronecker-factored gradient covariance matrix in deep learning (DL) training tasks are concentrated on a small leading eigenspace.
We describe a generic method for reducing memory and compute requirements of maintaining a matrix preconditioner.
We show extensions of our work to Shampoo, resulting in a method competitive in quality with Shampoo and Adam, yet requiring only sub-linear memory for tracking second moments.
- Score: 22.09320263962004
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adaptive regularization methods that exploit more than the diagonal entries
exhibit state of the art performance for many tasks, but can be prohibitive in
terms of memory and running time. We find the spectra of the Kronecker-factored
gradient covariance matrix in deep learning (DL) training tasks are
concentrated on a small leading eigenspace that changes throughout training,
motivating a low-rank sketching approach. We describe a generic method for
reducing memory and compute requirements of maintaining a matrix preconditioner
using the Frequent Directions (FD) sketch. While previous approaches have
explored applying FD for second-order optimization, we present a novel analysis
which allows efficient interpolation between resource requirements and the
degradation in regret guarantees with rank $k$: in the online convex
optimization (OCO) setting over dimension $d$, we match full-matrix $d^2$
memory regret using only $dk$ memory up to additive error in the bottom $d-k$
eigenvalues of the gradient covariance. Further, we show extensions of our work
to Shampoo, resulting in a method competitive in quality with Shampoo and Adam,
yet requiring only sub-linear memory for tracking second moments.
Related papers
- Efficient Adaptive Optimization via Subset-Norm and Subspace-Momentum: Fast, Memory-Reduced Training with Convergence Guarantees [5.399838579600896]
We introduce two complementary techniques for memory optimization.
One technique, Subset-Norm, reduces the momentum state's memory footprint by a low-dimensional subspace.
The other technique, Subspace-Momentum, reduces the momentum state's memory footprint by a low-dimensional subspace.
arXiv Detail & Related papers (2024-11-11T16:48:07Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - MGDA Converges under Generalized Smoothness, Provably [27.87166415148172]
Multi-objective optimization (MOO) is receiving more attention in various fields such as multi-task learning.
Recent works provide some effective algorithms with theoretical analysis but they are limited by the standard $L$-smooth or bounded-gradient assumptions.
We study a more general and realistic class of generalized $ell$-smooth loss functions, where $ell$ is a general non-decreasing function of gradient norm.
arXiv Detail & Related papers (2024-05-29T18:36:59Z) - Implicit Bias and Fast Convergence Rates for Self-attention [30.08303212679308]
Self-attention, the core mechanism of transformers, distinguishes them from traditional neural networks and drives their outstanding performance.
We investigate the implicit bias of gradient descent (GD) in training a self-attention layer with fixed linear decoder in binary.
We provide the first finite-time convergence rate for $W_t$ to $W_mm$, along with the rate of sparsification in the attention map.
arXiv Detail & Related papers (2024-02-08T15:15:09Z) - Iterative Reweighted Least Squares Networks With Convergence Guarantees
for Solving Inverse Imaging Problems [12.487990897680422]
We present a novel optimization strategy for image reconstruction tasks under analysis-based image regularization.
We parameterize such regularizers using potential functions that correspond to weighted extensions of the $ell_pp$-vector and $mathcalS_pp$ Schatten-matrix quasi-norms.
We show that thanks to the convergence guarantees of our proposed minimization strategy, such optimization can be successfully performed with a memory-efficient implicit back-propagation scheme.
arXiv Detail & Related papers (2023-08-10T17:59:46Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - Smoothed Online Convex Optimization Based on Discounted-Normal-Predictor [68.17855675511602]
We investigate an online prediction strategy named as Discounted-Normal-Predictor (Kapralov and Panigrahy, 2010) for smoothed online convex optimization (SOCO)
We show that the proposed algorithm can minimize the adaptive regret with switching cost in every interval.
arXiv Detail & Related papers (2022-05-02T08:48:22Z) - Continuous-Time Meta-Learning with Forward Mode Differentiation [65.26189016950343]
We introduce Continuous Meta-Learning (COMLN), a meta-learning algorithm where adaptation follows the dynamics of a gradient vector field.
Treating the learning process as an ODE offers the notable advantage that the length of the trajectory is now continuous.
We show empirically its efficiency in terms of runtime and memory usage, and we illustrate its effectiveness on a range of few-shot image classification problems.
arXiv Detail & Related papers (2022-03-02T22:35:58Z) - Large Scale Private Learning via Low-rank Reparametrization [77.38947817228656]
We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks.
We are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9%$ on four downstream tasks.
arXiv Detail & Related papers (2021-06-17T10:14:43Z) - Effective Dimension Adaptive Sketching Methods for Faster Regularized
Least-Squares Optimization [56.05635751529922]
We propose a new randomized algorithm for solving L2-regularized least-squares problems based on sketching.
We consider two of the most popular random embeddings, namely, Gaussian embeddings and the Subsampled Randomized Hadamard Transform (SRHT)
arXiv Detail & Related papers (2020-06-10T15:00:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.