Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
- URL: http://arxiv.org/abs/2505.00347v2
- Date: Mon, 09 Jun 2025 13:49:51 GMT
- Title: Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics
- Authors: Cong Xu, Wenbin Liang, Mo Yu, Anan Liu, Ke-Yue Zhang, Shunli Wang, Lizhuang Ma, Jianyong Wang, Jun Wang, Wei Zhang,
- Abstract summary: Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
- Score: 64.62231094774211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapid scaling of models has led to prohibitively high training and fine-tuning costs. A major factor accounting for memory consumption is the widespread use of stateful optimizers (e.g., Adam), which maintain auxiliary information of even 2x the model size in order to achieve optimal convergence. We therefore present SOLO in this work to spawn a novel type of optimizer that requires an extremely light memory footprint. While previous efforts have achieved certain success in 8-bit or 4-bit cases, SOLO enables Adam-style optimizers to maintain quantized states with precision as low as 3 bits, or even 2 bits. This immense progress is due to the identification and resolution of two key challenges: the signal swamping problem in unsigned quantization that results in unchanged state dynamics, and the increased gradient variance in signed quantization that leads to incorrect descent directions. The theoretical analysis suggests a tailored logarithmic quantization for the former and a precision-specific momentum hyperparameter for the latter. SOLO can thus be seamlessly applied to Adam-style optimizers, leading to substantial memory savings with minimal accuracy loss.
Related papers
- MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Dual Precision Quantization for Efficient and Accurate Deep Neural Networks Inference [3.7687375904925484]
We propose a novel hardware-efficient quantization and inference scheme that exploits hardware advantages with minimal accuracy degradation.<n>We develop a novel quantization algorithm, dubbed Dual Precision Quantization (DPQ), that leverages the unique structure of our scheme without introducing additional inference overhead.
arXiv Detail & Related papers (2025-05-20T17:26:12Z) - Automatic mixed precision for optimizing gained time with constrained loss mean-squared-error based on model partition to sequential sub-graphs [0.8999666725996975]
Mixed Precision (MP) mitigates the tradeoff by varying numerical precision across network layers.<n>This study focuses on automatically selecting an optimal MP configuration within Post-Training Quantization (PTQ) for inference.
arXiv Detail & Related papers (2025-05-19T12:51:02Z) - FineQ: Software-Hardware Co-Design for Low-Bit Fine-Grained Mixed-Precision Quantization of LLMs [13.951330786310262]
FineQ is a software- hardware co-design for low-bit fine-grained mixed-precision quantization of large language models.<n>It partitions the weights into finer-grained clusters and considers the distribution of outliers within these clusters.<n>It achieves higher model accuracy compared to the SOTA mixed-precision quantization algorithm at a close average bit-width.
arXiv Detail & Related papers (2025-04-28T12:47:23Z) - Gradient Multi-Normalization for Stateless and Scalable LLM Training [16.037614012166063]
Training large language models (LLMs) typically relies on adaptives like Adam which store additional state information to accelerate convergence but incur significant memory overhead.<n>Recent efforts, such as SWAN (Ma et al., 2024) address this by eliminating the need for states while achieving performance comparable to Adam via a multi-step preprocessing procedure applied to instantaneous gradients.<n>We introduce a novel framework for designing stateless gradients that normalizes gradients according to multiple norms. Experiments on pre-training LLaMA models with up to 1 billion parameters demonstrate a 3X speedup over Adam with significantly reduced memory requirements, outperforming other memory-efficient baseline
arXiv Detail & Related papers (2025-02-10T18:09:53Z) - ParetoQ: Scaling Laws in Extremely Low-bit LLM Quantization [58.84018707089315]
We present a unified framework for rigorous comparisons across 1-bit, 1.58-bit, 2-bit, 3-bit, and 4-bit quantization settings.<n>We show that ternary, 2-bit, and 3-bit quantization maintains comparable performance in the size-accuracy trade-off.<n>Considering hardware constraints, 2-bit quantization offers promising potential for memory reduction and speedup.
arXiv Detail & Related papers (2025-02-04T18:59:26Z) - Oscillations Make Neural Networks Robust to Quantization [0.16385815610837165]
We show that oscillations in Quantization Aware Training (QAT) are undesirable artifacts caused by the Straight-Through Estimator (STE)
We propose a novel regularization method that induces oscillations to improve quantization.
arXiv Detail & Related papers (2025-02-01T16:39:58Z) - GAQAT: gradient-adaptive quantization-aware training for domain generalization [54.31450550793485]
We propose a novel Gradient-Adaptive Quantization-Aware Training (GAQAT) framework for DG.<n>Our approach begins by identifying the scale-gradient conflict problem in low-precision quantization.<n>Extensive experiments validate the effectiveness of the proposed GAQAT framework.
arXiv Detail & Related papers (2024-12-07T06:07:21Z) - Joint Pruning and Channel-wise Mixed-Precision Quantization for Efficient Deep Neural Networks [10.229120811024162]
deep neural networks (DNNs) pose significant challenges to their deployment on edge devices.
Common approaches to address this issue are pruning and mixed-precision quantization.
We propose a novel methodology to apply them jointly via a lightweight gradient-based search.
arXiv Detail & Related papers (2024-07-01T08:07:02Z) - Gradient-based Automatic Mixed Precision Quantization for Neural Networks On-Chip [0.9187138676564589]
We present High Granularity Quantization (HGQ), an innovative quantization-aware training method.
HGQ fine-tune the per-weight and per-activation precision by making them optimizable through gradient descent.
This approach enables ultra-low latency and low power neural networks on hardware capable of performing arithmetic operations.
arXiv Detail & Related papers (2024-05-01T17:18:46Z) - DB-LLM: Accurate Dual-Binarization for Efficient LLMs [83.70686728471547]
Large language models (LLMs) have significantly advanced the field of natural language processing.
Existing ultra-low-bit quantization always causes severe accuracy drops.
We propose a novel Dual-Binarization method for LLMs, namely DB-LLM.
arXiv Detail & Related papers (2024-02-19T09:04:30Z) - AdaLomo: Low-memory Optimization with Adaptive Learning Rate [59.64965955386855]
We introduce low-memory optimization with adaptive learning rate (AdaLomo) for large language models.
AdaLomo results on par with AdamW, while significantly reducing memory requirements, thereby lowering the hardware barrier to training large language models.
arXiv Detail & Related papers (2023-10-16T09:04:28Z) - On-Chip Hardware-Aware Quantization for Mixed Precision Neural Networks [52.97107229149988]
We propose an On-Chip Hardware-Aware Quantization framework, performing hardware-aware mixed-precision quantization on deployed edge devices.
For efficiency metrics, we built an On-Chip Quantization Aware pipeline, which allows the quantization process to perceive the actual hardware efficiency of the quantization operator.
For accuracy metrics, we propose Mask-Guided Quantization Estimation technology to effectively estimate the accuracy impact of operators in the on-chip scenario.
arXiv Detail & Related papers (2023-09-05T04:39:34Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - AMED: Automatic Mixed-Precision Quantization for Edge Devices [3.5223695602582614]
Quantized neural networks are well known for reducing the latency, power consumption, and model size without significant harm to the performance.
Mixed-precision quantization offers better utilization of customized hardware that supports arithmetic operations at different bitwidths.
arXiv Detail & Related papers (2022-05-30T21:23:22Z) - 8-bit Optimizers via Block-wise Quantization [57.25800395197516]
Statefuls maintain statistics over time, e.g., the exponentially smoothed sum (SGD with momentum) or squared sum (Adam) of past values.
This state can be used to accelerate optimization compared to plain gradient descent but uses memory that might otherwise be allocated to model parameters.
In this paper, we develop first gradients that use 8-bit statistics while maintaining the performance levels of using 32-bit gradient states.
arXiv Detail & Related papers (2021-10-06T15:43:20Z) - VS-Quant: Per-vector Scaled Quantization for Accurate Low-Precision
Neural Network Inference [7.886868529510128]
Quantization maps floating-point weights and activations in a trained model to low-bitwidth integer values using scale factors.
Excessive quantization, reducing precision too aggressively, results in accuracy degradation.
Per-vector scale factors can be implemented with low-bitwidth integers when using a two-level quantization scheme.
arXiv Detail & Related papers (2021-02-08T19:56:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.