Related papers: LOTION: Smoothing the Optimization Landscape for Quantized Training

LOTION: Smoothing the Optimization Landscape for Quantized Training

URL: http://arxiv.org/abs/2510.08757v1
Date: Thu, 09 Oct 2025 19:16:46 GMT
Title: LOTION: Smoothing the Optimization Landscape for Quantized Training
Authors: Mujin Kwun, Depen Morwani, Chloe Huangyuan Su, Stephanie Gil, Nikhil Anand, Sham Kakade,
Abstract summary: We introduce LOTION, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise.<n>We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.
Score: 9.855789447735757
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimizing neural networks for quantized objectives is fundamentally challenging because the quantizer is piece-wise constant, yielding zero gradients everywhere except at quantization thresholds where the derivative is undefined. Most existing methods deal with this issue by relaxing gradient computations with techniques like Straight Through Estimators (STE) and do not provide any guarantees of convergence. In this work, taking inspiration from Nesterov smoothing, we approximate the quantized loss surface with a continuous loss surface. In particular, we introduce LOTION, \textbf{L}ow-precision \textbf{O}ptimization via s\textbf{T}ochastic-no\textbf{I}se sm\textbf{O}othi\textbf{N}g, a principled smoothing framework that replaces the raw quantized loss with its expectation under unbiased randomized-rounding noise. In this framework, standard optimizers are guaranteed to converge to a local minimum of the loss surface. Moreover, when using noise derived from stochastic rounding, we show that the global minima of the original quantized loss are preserved. We empirically demonstrate that this method outperforms standard QAT on synthetic testbeds and on 150M- and 300M- parameter language models.

Related papers

Derivative-Free Sequential Quadratic Programming for Equality-Constrained Stochastic Optimization [3.2489082010225485]
We consider solving nonlinear optimization problems with an objective and deterministic equality constraints.<n>We propose a Derivative-Free Sequential Quadratic Programming (DF-SSQP) method.<n>Under standard assumptions, we establish the global almost-sure convergence of the proposed DF-SSQP method.
arXiv Detail & Related papers (2025-10-25T23:51:20Z)
On the Optimal Construction of Unbiased Gradient Estimators for Zeroth-Order Optimization [57.179679246370114]
A potential limitation of existing methods is the bias inherent in most perturbation estimators unless a stepsize is proposed.<n>We propose a novel family of unbiased gradient scaling estimators that eliminate bias while maintaining favorable construction.
arXiv Detail & Related papers (2025-10-22T18:25:43Z)
Closed-Form Last Layer Optimization [72.49151473937319]
Under a squared loss, the optimal solution to the linear last layer weights is known in closed-form.<n>We show this is equivalent to alternating between gradient descent steps on the backbone and closed-form updates on the last layer.
arXiv Detail & Related papers (2025-10-06T09:14:39Z)
Progressive Element-wise Gradient Estimation for Neural Network Quantization [2.1413624861650358]
Quantization-Aware Training (QAT) methods rely on the Straight-Through Estimator (STE) to address the non-differentiability of discretization functions.<n>We propose Progressive Element-wise Gradient Estimation (PEGE) to address discretization errors between continuous and quantized values.<n>PEGE consistently outperforms existing backpropagation methods and enables low-precision models to match or even outperform the accuracy of their full-precision counterparts.
arXiv Detail & Related papers (2025-08-27T15:59:36Z)
MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
FIMA-Q: Post-Training Quantization for Vision Transformers by Fisher Information Matrix Approximation [55.12070409045766]
Post-training quantization (PTQ) has stood out as a cost-effective and promising model compression paradigm in recent years.<n>Current PTQ methods for Vision Transformers (ViTs) still suffer from significant accuracy degradation, especially under low-bit quantization.
arXiv Detail & Related papers (2025-06-13T07:57:38Z)
Gradient Normalization Provably Benefits Nonconvex SGD under Heavy-Tailed Noise [60.92029979853314]
We investigate the roles of gradient normalization and clipping in ensuring the convergence of Gradient Descent (SGD) under heavy-tailed noise. Our work provides the first theoretical evidence demonstrating the benefits of gradient normalization in SGD under heavy-tailed noise. We introduce an accelerated SGD variant incorporating gradient normalization and clipping, further enhancing convergence rates under heavy-tailed noise.
arXiv Detail & Related papers (2024-10-21T22:40:42Z)
Robust Stochastic Optimization via Gradient Quantile Clipping [6.2844649973308835]
We introduce a quant clipping strategy for Gradient Descent (SGD) We use gradient new outliers as norm clipping chains. We propose an implementation of the algorithm using Huberiles.
arXiv Detail & Related papers (2023-09-29T15:24:48Z)
Randomized semi-quantum matrix processing [0.0]
We present a hybrid quantum-classical framework for simulating generic matrix functions. The method is based on randomization over the Chebyshev approximation of the target function. We prove advantages on average depths, including quadratic speed-ups on costly parameters.
arXiv Detail & Related papers (2023-07-21T18:00:28Z)
Error-aware Quantization through Noise Tempering [43.049102196902844]
Quantization-aware training (QAT) optimize model parameters with respect to the end task while simulating quantization error. In this work, we incorporate exponentially decaying quantization-error-aware noise together with a learnable scale of task loss gradient to approximate the effect of a quantization operator. Our method obtains state-of-the-art top-1 classification accuracy for uniform (non mixed-precision) quantization, out-performing previous methods by 0.5-1.2% absolute.
arXiv Detail & Related papers (2022-12-11T20:37:50Z)
STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization [74.1615979057429]
We investigate non-batch optimization problems where the objective is an expectation over smooth loss functions. Our work builds on the STORM algorithm, in conjunction with a novel approach to adaptively set the learning rate and momentum parameters.
arXiv Detail & Related papers (2021-11-01T15:43:36Z)
High Probability Complexity Bounds for Non-Smooth Stochastic Optimization with Heavy-Tailed Noise [51.31435087414348]
It is essential to theoretically guarantee that algorithms provide small objective residual with high probability. Existing methods for non-smooth convex optimization have complexity bounds with dependence on confidence level. We propose novel stepsize rules for two methods with gradient clipping.
arXiv Detail & Related papers (2021-06-10T17:54:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.