Related papers: Memory-Efficient 4-bit Preconditioned Stochastic Optimization

Memory-Efficient 4-bit Preconditioned Stochastic Optimization

URL: http://arxiv.org/abs/2412.10663v1
Date: Sat, 14 Dec 2024 03:32:54 GMT
Title: Memory-Efficient 4-bit Preconditioned Stochastic Optimization
Authors: Jingyang Li, Kuangyu Ding, Kim-Chuan Toh, Pan Zhou,
Abstract summary: We introduce 4-bit quantization for Shampoo's preconditioners. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners.
Score: 53.422307389223626
License:
Abstract: Preconditioned stochastic optimization algorithms, exemplified by Shampoo, have demonstrated superior performance over first-order optimizers, providing both theoretical advantages in convergence rates and practical improvements in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo's preconditioners. We introduced two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while preserving symmetry and positive definiteness to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factors and error states in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings.

Related papers

ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization [58.84018707089315]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings. We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off. Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z)
4-bit Shampoo for Memory-Efficient Network Training [69.08646370812065]
Second-order computation is superior to first-order computation in theory and practice. compressing 32-bit states to lower bitwidths has shown promise in reducing memory usage. We propose the first 4-bit second-order, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones.
arXiv Detail & Related papers (2024-05-28T13:02:56Z)
Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation. Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z)
Towards Mixed-Precision Quantization of Neural Networks via Constrained Optimization [28.76708310896311]
We present a principled framework to solve the mixed-precision quantization problem. We show that our method is derived in a principled way and much more computationally efficient.
arXiv Detail & Related papers (2021-10-13T08:09:26Z)
Beyond Neighbourhood-Preserving Transformations for Quantization-Based Unsupervised Hashing [0.0]
An effective unsupervised hashing algorithm leads to compact binary codes preserving the neighborhood structure of data as much as possible. Although employing rigid transformations is effective, we may not reduce quantization loss to the ultimate limits. Motivated by these shortcomings, we propose to employ both rigid and non-rigid transformations to reduce quantization error and dimensionality simultaneously.
arXiv Detail & Related papers (2021-10-01T05:13:01Z)
Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication. We prove that preconditioning has an additional benefit that has been previously unexplored. It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator [75.05106948314956]
We show that an increasing large momentum parameter for the first-order moment is sufficient for adaptive scaling. We also give insights for increasing the momentum in a stagewise manner in accordance with stagewise decreasing step size.
arXiv Detail & Related papers (2021-04-30T08:50:24Z)
Improving the Quantum Approximate Optimization Algorithm with postselection [0.0]
Combinatorial optimization is among the main applications envisioned for near-term and fault-tolerant quantum computers. We consider a well-studied quantum algorithm for optimization: the Quantum Approximate Optimization Algorithm (QAOA) applied to the MaxCut problem on 3-regular graphs. We derive theoretical upper and lower bounds showing that a constant (though small) increase of the fraction of satisfied edges is indeed achievable.
arXiv Detail & Related papers (2020-11-10T22:17:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.