InRank: Incremental Low-Rank Learning
- URL: http://arxiv.org/abs/2306.11250v2
- Date: Mon, 1 Jan 2024 03:43:41 GMT
- Title: InRank: Incremental Low-Rank Learning
- Authors: Jiawei Zhao, Yifei Zhang, Beidi Chen, Florian Sch\"afer, Anima
Anandkumar
- Abstract summary: gradient-based training implicitly regularizes neural networks towards low-rank solutions through a gradual increase of the rank during training.
Existing training algorithms do not exploit the low-rank property to improve computational efficiency.
We design a new training algorithm Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative weight updates as low-rank matrices.
- Score: 85.6380047359139
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The theory of greedy low-rank learning (GLRL) aims to explain the impressive
generalization capabilities of deep learning. It proves that stochastic
gradient-based training implicitly regularizes neural networks towards low-rank
solutions through a gradual increase of the rank during training. However,
there is a gap between theory and practice since GLRL requires an infinitesimal
initialization of the weights, which is not practical due to the fact that it
is a saddle point. In this work, we remove the assumption of infinitesimal
initialization by focusing on cumulative weight updates. We prove the
cumulative weight updates follow an incremental low-rank trajectory for
arbitrary orthogonal initialization of weights in a three-layer linear network.
Empirically, we demonstrate that our theory holds on a broad range of neural
networks (e.g., transformers) and standard training algorithms (e.g., SGD,
Adam). However, existing training algorithms do not exploit the low-rank
property to improve computational efficiency as the networks are not
parameterized in low-rank. To remedy this, we design a new training algorithm
Incremental Low-Rank Learning (InRank), which explicitly expresses cumulative
weight updates as low-rank matrices while incrementally augmenting their ranks
during training. We evaluate InRank on GPT-2, and our results indicate that
InRank achieves comparable prediction performance as the full-rank counterpart
while requiring at most 33% of the total ranks throughout training. We also
propose an efficient version of InRank that achieves a reduction of 37% in
total training time and 36% in model size when training GPT-medium on
WikiText-103 from scratch.
Related papers
- MARS: Unleashing the Power of Variance Reduction for Training Large Models [56.47014540413659]
Large gradient algorithms like Adam, Adam, and their variants have been central to the development of this type of training.
We propose a framework that reconciles preconditioned gradient optimization methods with variance reduction via a scaled momentum technique.
arXiv Detail & Related papers (2024-11-15T18:57:39Z) - Exact, Tractable Gauss-Newton Optimization in Deep Reversible Architectures Reveal Poor Generalization [52.16435732772263]
Second-order optimization has been shown to accelerate the training of deep neural networks in many applications.
However, generalization properties of second-order methods are still being debated.
We show for the first time that exact Gauss-Newton (GN) updates take on a tractable form in a class of deep architectures.
arXiv Detail & Related papers (2024-11-12T17:58:40Z) - No Train No Gain: Revisiting Efficient Training Algorithms For
Transformer-based Language Models [31.080446886440757]
In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, dropping), batch selection (selective backprop, RHO loss), and efficient layers (Lion, Sophia)
We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate.
We define an evaluation protocol that enables machines to be done on arbitrary computation by mapping all computation time to a reference machine which we call reference system time.
arXiv Detail & Related papers (2023-07-12T20:10:14Z) - A Framework for Provably Stable and Consistent Training of Deep
Feedforward Networks [4.21061712600981]
We present a novel algorithm for training deep neural networks in supervised (classification and regression) and unsupervised (reinforcement learning) scenarios.
This algorithm combines the standard descent gradient and the gradient clipping method.
We show, in theory and through experiments, that our algorithm updates have low variance, and the training loss reduces in a smooth manner.
arXiv Detail & Related papers (2023-05-20T07:18:06Z) - Are Straight-Through gradients and Soft-Thresholding all you need for
Sparse Training? [21.889275006087875]
Turning weights to zero when training a neural network helps in reducing the computational complexity at inference.
To progressively increase the sparsity ratio in the network without causing sharp weight discontinuities during training, our work combines soft-thresholding and straight-through gradient estimation.
Our method, named ST-3 for straight-through/soft-thresholding/sparse-training, obtains SoA results, both in terms of accuracy/sparsity and accuracy/FLOPS trade-offs.
arXiv Detail & Related papers (2022-12-02T10:32:44Z) - Low-rank lottery tickets: finding efficient low-rank neural networks via
matrix differential equations [2.3488056916440856]
We propose a novel algorithm to find efficient low-rankworks.
Theseworks are determined and adapted already during the training phase.
Our method automatically and dynamically adapts the ranks during training to achieve a desired approximation accuracy.
arXiv Detail & Related papers (2022-05-26T18:18:12Z) - Effective Model Sparsification by Scheduled Grow-and-Prune Methods [73.03533268740605]
We propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models.
Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks.
arXiv Detail & Related papers (2021-06-18T01:03:13Z) - Weight Update Skipping: Reducing Training Time for Artificial Neural
Networks [0.30458514384586394]
We propose a new training methodology for ANNs that exploits the observation of improvement of accuracy shows temporal variations.
During such time windows, we keep updating bias which ensures the network still trains and avoids overfitting.
Such a training approach virtually achieves the same accuracy with considerably less computational cost, thus lower training time.
arXiv Detail & Related papers (2020-12-05T15:12:10Z) - TRP: Trained Rank Pruning for Efficient Deep Neural Networks [69.06699632822514]
We propose Trained Rank Pruning (TRP), which alternates between low rank approximation and training.
A nuclear regularization optimized by sub-gradient descent is utilized to further promote low rank in TRP.
The TRP trained network inherently has a low-rank structure, and is approximated with negligible performance loss.
arXiv Detail & Related papers (2020-04-30T03:37:36Z) - Learning Low-rank Deep Neural Networks via Singular Vector Orthogonality
Regularization and Singular Value Sparsification [53.50708351813565]
We propose SVD training, the first method to explicitly achieve low-rank DNNs during training without applying SVD on every step.
We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy.
arXiv Detail & Related papers (2020-04-20T02:40:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.