Memory Efficient Optimizers with 4-bit States
- URL: http://arxiv.org/abs/2309.01507v3
- Date: Fri, 27 Oct 2023 06:24:08 GMT
- Title: Memory Efficient Optimizers with 4-bit States
- Authors: Bingrui Li, Jianfei Chen, Jun Zhu
- Abstract summary: We push states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments.
We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization.
Our 4-bits are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning.
- Score: 22.605392665667136
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimizer states are a major source of memory consumption for training neural
networks, limiting the maximum trainable model within given memory budget.
Compressing the optimizer states from 32-bit floating points to lower bitwidth
is promising to reduce the training memory footprint, while the current lowest
achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth
down to 4-bit through a detailed empirical analysis of first and second
moments. Specifically, we find that moments have complicated outlier patterns,
that current block-wise quantization cannot accurately approximate. We use a
smaller block size and propose to utilize both row-wise and column-wise
information for better quantization. We further identify a zero point problem
of quantizing the second moment, and solve this problem with a linear quantizer
that excludes the zero point. Our 4-bit optimizers are evaluated on a wide
variety of benchmarks including natural language understanding, machine
translation, image classification, and instruction tuning. On all the tasks our
optimizers can achieve comparable accuracy with their full-precision
counterparts, while enjoying better memory efficiency.
Related papers
- SageAttention2 Technical Report: Accurate 4 Bit Attention for Plug-and-play Inference Acceleration [22.551095978580147]
We propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside precision-enhancing techniques.
We analyze the quantization accuracy across timesteps and layers, then propose an adaptive quantization method to ensure the end-to-end metrics.
Experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models.
arXiv Detail & Related papers (2024-11-17T04:35:49Z) - 4-bit Shampoo for Memory-Efficient Network Training [69.08646370812065]
Second-order computation is superior to first-order computation in theory and practice.
compressing 32-bit states to lower bitwidths has shown promise in reducing memory usage.
We propose the first 4-bit second-order, exemplified by 4-bit Shampoo, maintaining performance similar to that of 32-bit ones.
arXiv Detail & Related papers (2024-05-28T13:02:56Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - Optimizing data-flow in Binary Neural Networks [0.0]
We propose a novel training scheme that can increase data flow and parallelism in the BNN pipeline.
We also present an optimized implementation of the Binary Direct Convolution for ARM instruction sets.
Our experiments show a consistent improvement of the inference speed (up to 1.91 and 2.73x compared to two state-of-the-art BNNs frameworks) with no drop in accuracy for at least one full-precision model.
arXiv Detail & Related papers (2023-04-03T13:16:33Z) - The case for 4-bit precision: k-bit Inference Scaling Laws [75.4335600212427]
Quantization methods reduce the number of bits required to represent each parameter in a model.
The final model size depends on both the number of parameters of the original model and the rate of compression.
We run more than 35,000 zero-shot experiments with 16-bit inputs and k-bit parameters to examine which quantization methods improve scaling for 3 to 8-bit precision.
arXiv Detail & Related papers (2022-12-19T18:48:33Z) - Sub 8-Bit Quantization of Streaming Keyword Spotting Models for Embedded
Chipsets [7.5195830365852085]
We propose a novel sub 8-bit quantization aware training algorithm for all components of a 250K parameter feedforward, streaming, state-free keyword spotting model.
We conduct large scale experiments, training on 26,000 hours of de-identified production, far-field and near-field audio data.
arXiv Detail & Related papers (2022-07-13T17:46:08Z) - MCUNetV2: Memory-Efficient Patch-based Inference for Tiny Deep Learning [72.80896338009579]
We find that the memory bottleneck is due to the imbalanced memory distribution in convolutional neural network (CNN) designs.
We propose a generic patch-by-patch inference scheduling, which significantly cuts down the peak memory.
We automate the process with neural architecture search to jointly optimize the neural architecture and inference scheduling, leading to MCUNetV2.
arXiv Detail & Related papers (2021-10-28T17:58:45Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - BitPruning: Learning Bitlengths for Aggressive and Accurate Quantization [57.14179747713731]
We introduce a training method for minimizing inference bitlength at any granularity while maintaining accuracy.
With ImageNet, the method produces an average per layer bitlength of 4.13, 3.76 and 4.36 bits.
arXiv Detail & Related papers (2020-02-08T04:58:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.