Memory-Efficient 4-bit Preconditioned Stochastic Optimization
- URL: http://arxiv.org/abs/2412.10663v2
- Date: Wed, 12 Mar 2025 09:19:31 GMT
- Title: Memory-Efficient 4-bit Preconditioned Stochastic Optimization
- Authors: Jingyang Li, Kuangyu Ding, Kim-Chuan Toh, Pan Zhou,
- Abstract summary: We introduce 4-bit quantization for Shampoo's preconditioners.<n>To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners.<n>We demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance.
- Score: 53.422307389223626
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preconditioned stochastic optimization algorithms, exemplified by Shampoo, outperform first-order optimizers by offering theoretical convergence benefits and practical gains in large-scale neural network training. However, they incur substantial memory overhead due to the storage demands of non-diagonal preconditioning matrices. To address this, we introduce 4-bit quantization for Shampoo's preconditioners. We introduce two key methods: First, we apply Cholesky decomposition followed by quantization of the Cholesky factors, reducing memory usage by leveraging their lower triangular structure while better preserving spectral properties to minimize information loss. To our knowledge, this is the first quantization approach applied to Cholesky factors of preconditioners. Second, we incorporate error feedback in the quantization process, efficiently storing Cholesky factor and error state in the lower and upper triangular parts of the same matrix. Through extensive experiments, we demonstrate that combining Cholesky quantization with error feedback enhances memory efficiency and algorithm performance in large-scale deep-learning tasks. Theoretically, we also provide convergence proofs for quantized Shampoo under both smooth and non-smooth stochastic optimization settings.
Related papers
- Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [65.37942405146232]
We present a novel type of overload that carries with extremely lightweight state elements, achieved through ultra-low-precision quantization.
The proposed SOLO achieves substantial memory savings (approximately 45 GB when training a 7B model) with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - Preparation Circuits for Matrix Product States by Classical Variational Disentanglement [0.0]
We study the classical compilation of quantum circuits for the preparation of matrix product states (MPS)
Our algorithm represents a near-term alternative to previous sequential approaches by reverse application of a disentangler.
We show numerical results for ground states of one-dimensional, local Hamiltonians as well as artificially spread out entanglement among multiple qubits.
arXiv Detail & Related papers (2025-04-30T04:13:01Z) - Gaussian boson sampling for binary optimization [0.0]
We propose to use a parametrized Gaussian Boson Sampler (GBS) with threshold detectors to address binary optimization problems.
Numerical experiments on 3-SAT and Graphing problems show significant performance gains over random guessing.
arXiv Detail & Related papers (2024-12-19T12:12:22Z) - 4-bit Shampoo for Memory-Efficient Network Training [69.08646370812065]
Second-order computation is superior to first-order computation in theory and practice.
compressing 32-bit states to lower bitwidths has shown promise in reducing memory usage.
We propose the first 4-bit second-order, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones.
arXiv Detail & Related papers (2024-05-28T13:02:56Z) - Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models.
Existing methods often compromise precision or require extra data for calibration.
We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z) - Quantization of Large Language Models with an Overdetermined Basis [73.79368761182998]
We introduce an algorithm for data quantization based on the principles of Kashin representation.
Our findings demonstrate that Kashin Quantization achieves competitive or superior quality in model performance.
arXiv Detail & Related papers (2024-04-15T12:38:46Z) - Quantum Approximation Optimization Algorithm for the Trellis based Viterbi Decoding of Classical Error Correcting Codes [0.0]
We construct a hybrid quantum-classical Viterbi decoder for the classical error-correcting codes.
We show that the quantum approximate optimization algorithm can find any path on the trellis with the minimum distance relative to the received erroneous vector.
arXiv Detail & Related papers (2023-04-05T08:25:14Z) - Automatic and effective discovery of quantum kernels [43.702574335089736]
Quantum computing can empower machine learning models by enabling kernel machines to leverage quantum kernels for representing similarity measures between data.
We present a different approach, which employs optimization techniques, similar to those used in neural architecture search and AutoML.
The results obtained by testing our approach on a high-energy physics problem demonstrate that, in the best-case scenario, we can either match or improve testing accuracy with respect to the manual design approach.
arXiv Detail & Related papers (2022-09-22T16:42:14Z) - Towards Mixed-Precision Quantization of Neural Networks via Constrained
Optimization [28.76708310896311]
We present a principled framework to solve the mixed-precision quantization problem.
We show that our method is derived in a principled way and much more computationally efficient.
arXiv Detail & Related papers (2021-10-13T08:09:26Z) - Beyond Neighbourhood-Preserving Transformations for Quantization-Based
Unsupervised Hashing [0.0]
An effective unsupervised hashing algorithm leads to compact binary codes preserving the neighborhood structure of data as much as possible.
Although employing rigid transformations is effective, we may not reduce quantization loss to the ultimate limits.
Motivated by these shortcomings, we propose to employ both rigid and non-rigid transformations to reduce quantization error and dimensionality simultaneously.
arXiv Detail & Related papers (2021-10-01T05:13:01Z) - Unified Convergence Analysis for Adaptive Optimization with Moving Average Estimator [75.05106948314956]
We show that an increasing large momentum parameter for the first-order moment is sufficient for adaptive scaling.
We also give insights for increasing the momentum in a stagewise manner in accordance with stagewise decreasing step size.
arXiv Detail & Related papers (2021-04-30T08:50:24Z) - Improving the Quantum Approximate Optimization Algorithm with
postselection [0.0]
Combinatorial optimization is among the main applications envisioned for near-term and fault-tolerant quantum computers.
We consider a well-studied quantum algorithm for optimization: the Quantum Approximate Optimization Algorithm (QAOA) applied to the MaxCut problem on 3-regular graphs.
We derive theoretical upper and lower bounds showing that a constant (though small) increase of the fraction of satisfied edges is indeed achievable.
arXiv Detail & Related papers (2020-11-10T22:17:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.