Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost
- URL: http://arxiv.org/abs/2602.03120v1
- Date: Tue, 03 Feb 2026 05:24:31 GMT
- Title: Quantized Evolution Strategies: High-precision Fine-tuning of Quantized LLMs at Low-precision Cost
- Authors: Yinggan Xu, Risto Miikkulainen, Xin Qiu,
- Abstract summary: Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices.<n>This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full- parameter fine-tuning directly in the quantized space.<n>QES significantly outperforms the state-of-the-art zeroth-order fine-tuning method on arithmetic reasoning tasks.
- Score: 12.23633538816503
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Post-Training Quantization (PTQ) is essential for deploying Large Language Models (LLMs) on memory-constrained devices, yet it renders models static and difficult to fine-tune. Standard fine-tuning paradigms, including Reinforcement Learning (RL), fundamentally rely on backpropagation and high-precision weights to compute gradients. Thus they cannot be used on quantized models, where the parameter space is discrete and non-differentiable. While Evolution Strategies (ES) offer a backpropagation-free alternative, optimization of the quantized parameters can still fail due to vanishing or inaccurate gradient. This paper introduces Quantized Evolution Strategies (QES), an optimization paradigm that performs full-parameter fine-tuning directly in the quantized space. QES is based on two innovations: (1) it integrates accumulated error feedback to preserve high-precision gradient signals, and (2) it utilizes a stateless seed replay to reduce memory usage to low-precision inference levels. QES significantly outperforms the state-of-the-art zeroth-order fine-tuning method on arithmetic reasoning tasks, making direct fine-tuning for quantized models possible. It therefore opens up the possibility for scaling up LLMs entirely in the quantized space. The source code is available at https://github.com/dibbla/Quantized-Evolution-Strategies .
Related papers
- End-to-End On-Device Quantization-Aware Training for LLMs at Inference Cost [53.25965863436039]
Quantization-aware training (QAT) provides a more principled solution, but its reliance on backpropagation incurs prohibitive memory costs.<n>We propose ZeroQAT, a zeroth-order optimization-based QAT framework that supports both weight and activation quantization.<n>Experiments show that ZeroQAT consistently outperforms representative PTQ and QAT baselines while requiring significantly less memory.
arXiv Detail & Related papers (2025-08-21T01:18:27Z) - MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z) - Boost Post-Training Quantization via Null Space Optimization for Large Language Models [66.73751310500656]
Existing post-training quantization methods for large language models (LLMs) offer remarkable success.<n>The increasingly marginal performance gains suggest that existing quantization strategies are insufficient to support the development of more compressed models.<n>We argue that the quantization error can be effectively alleviated by constraining the post-quantization weight to lie within the null space of input activations.
arXiv Detail & Related papers (2025-05-21T14:07:07Z) - Fine-tuning Quantized Neural Networks with Zeroth-order Optimization [21.0540879091664]
We propose Quantized Zeroth-order Optimization (QZO), a simple yet effective approach that perturbs the continuous quantization scale for gradient estimation.<n>QZO can reduce the total memory cost by more than 18$times$ for 4-bit LLMs, and enables fine-tuning Llama-2-13B within a single 24GB GPU.
arXiv Detail & Related papers (2025-05-19T17:55:15Z) - Pushing the Limits of Low-Bit Optimizers: A Focus on EMA Dynamics [64.62231094774211]
Statefuls (e.g., Adam) maintain auxiliary information even 2x the model size in order to achieve optimal convergence.<n>SOLO enables Adam-styles to maintain quantized states with precision as low as 3 bits, or even 2 bits.<n>SOLO can thus be seamlessly applied to Adam-styles, leading to substantial memory savings with minimal accuracy loss.
arXiv Detail & Related papers (2025-05-01T06:47:45Z) - Training Deep Learning Models with Norm-Constrained LMOs [56.00317694850397]
We propose a new family of algorithms that uses the linear minimization oracle (LMO) to adapt to the geometry of the problem.<n>We demonstrate significant speedups on nanoGPT training using our algorithm, Scion, without any reliance on Adam.
arXiv Detail & Related papers (2025-02-11T13:10:34Z) - QSpec: Speculative Decoding with Complementary Quantization Schemes [53.960146187821685]
Quantization is widely adopted to accelerate inference and reduce memory consumption in large language models (LLMs)<n>We propose QSpec, a novel quantization paradigm that decouples efficiency from quality.<n>QSpec reuses both weights and KV cache across stages, enabling near-zero-cost switching without retraining or auxiliary models.
arXiv Detail & Related papers (2024-10-15T05:57:51Z) - SLiM: One-shot Quantization and Sparsity with Low-rank Approximation for LLM Weight Compression [7.6131620435684875]
SLIM is a new one-shot compression framework that holistically integrates hardware-friendly quantization, sparsity, and low-rank approximation.<n>SLIM improves model accuracy by up to 5.66% (LLaMA-2-7B) for 2:4 sparsity with 4-bit weight quantization, outperforming prior methods.
arXiv Detail & Related papers (2024-10-12T18:36:07Z) - OAC: Output-adaptive Calibration for Accurate Post-training Quantization [28.67781845829386]
Post-training Quantization (PTQ) techniques have been developed to compress Large Language Models (LLMs)<n>Most PTQ approaches formulate the quantization error based on a layer-wise Euclidean loss, ignoring the model output.<n>We propose Output-adaptive Quantization (OAC) to incorporate the model output in the calibration process.
arXiv Detail & Related papers (2024-05-23T20:01:17Z) - QFT: Quantized Full-parameter Tuning of LLMs with Affordable Resources [35.16907522675046]
Large Language Models (LLMs) have showcased remarkable impacts across a wide spectrum of natural language processing tasks.<n>Fine-tuning these pretrained models on downstream datasets provides significant performance gains.<n>This process typically requires a large number of expensive, high-end GPU.<n>We propose QFT, a Quantized Full- parameter Tuning framework that quantizes and stores all training states.
arXiv Detail & Related papers (2023-10-11T02:47:40Z) - NUPES : Non-Uniform Post-Training Quantization via Power Exponent Search [7.971065005161565]
quantization is a technique to convert floating point representations to low bit-width fixed point representations.
We show how to learn new quantized weights over the entire quantized space.
We show the ability of the method to achieve state-of-the-art compression rates in both, data-free and data-driven configurations.
arXiv Detail & Related papers (2023-08-10T14:19:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.