4-bit Shampoo for Memory-Efficient Network Training
- URL: http://arxiv.org/abs/2405.18144v2
- Date: Sun, 27 Oct 2024 15:38:02 GMT
- Title: 4-bit Shampoo for Memory-Efficient Network Training
- Authors: Sike Wang, Pan Zhou, Jia Li, Hua Huang,
- Abstract summary: Second-order computation is superior to first-order computation in theory and practice.
compressing 32-bit states to lower bitwidths has shown promise in reducing memory usage.
We propose the first 4-bit second-order, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones.
- Score: 69.08646370812065
- License:
- Abstract: Second-order optimizers, maintaining a matrix termed a preconditioner, are superior to first-order optimizers in both theory and practice. The states forming the preconditioner and its inverse root restrict the maximum size of models trained by second-order optimizers. To address this, compressing 32-bit optimizer states to lower bitwidths has shown promise in reducing memory usage. However, current approaches only pertain to first-order optimizers. In this paper, we propose the first 4-bit second-order optimizers, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones. We show that quantizing the eigenvector matrix of the preconditioner in 4-bit Shampoo is remarkably better than quantizing the preconditioner itself both theoretically and experimentally. By rectifying the orthogonality of the quantized eigenvector matrix, we enhance the approximation of the preconditioner's eigenvector matrix, which also benefits the computation of its inverse 4-th root. Besides, we find that linear square quantization slightly outperforms dynamic tree quantization when quantizing second-order optimizer states. Evaluation on various networks for image classification and natural language modeling demonstrates that our 4-bit Shampoo achieves comparable performance to its 32-bit counterpart while being more memory-efficient.
Related papers
- State-Free Inference of State-Space Models: The Transfer Function Approach [132.83348321603205]
State-free inference does not incur any significant memory or computational cost with an increase in state size.
We achieve this using properties of the proposed frequency domain transfer function parametrization.
We report improved perplexity in language modeling over a long convolutional Hyena baseline.
arXiv Detail & Related papers (2024-05-10T00:06:02Z) - A Computationally Efficient Sparsified Online Newton Method [48.78646010774149]
Sparsified Online Newton (SONew) is a memory-efficient second-order algorithm that yields a sparsified yet effective preconditioner.
We achieve up to 30% faster convergence, 3.4% relative improvement in validation, and 80% relative improvement in training loss.
arXiv Detail & Related papers (2023-11-16T18:44:22Z) - Memory Efficient Optimizers with 4-bit States [22.605392665667136]
We push states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments.
We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization.
Our 4-bits are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning.
arXiv Detail & Related papers (2023-09-04T10:27:17Z) - KrADagrad: Kronecker Approximation-Domination Gradient Preconditioned
Stochastic Optimization [69.47358238222586]
Second orderimations allow parameter update step size and direction to adapt to loss curvature.
Recently, Shampoo introduced a Kronecker factored preconditioner to reduce these requirements.
It takes inverse matrix roots of ill-conditioned matrices.
This requires 64-bit precision, imposing strong hardware constraints.
arXiv Detail & Related papers (2023-05-30T21:15:45Z) - Size optimization of CNOT circuits on NISQ [13.391818915679796]
We study the optimization of the CNOT circuits on some noisy intermediate-scale quantum(NISQ) devices.
We implement our algorithm on IBM20 and some other NISQ devices, the results are better than most other methods in our experiment.
arXiv Detail & Related papers (2022-10-11T06:44:04Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - Reducing the Variance of Gaussian Process Hyperparameter Optimization
with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication.
We prove that preconditioning has an additional benefit that has been previously unexplored.
It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.