Related papers: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms

URL: http://arxiv.org/abs/2509.09679v2
Date: Thu, 25 Sep 2025 15:12:18 GMT
Title: ButterflyQuant: Ultra-low-bit LLM Quantization through Learnable Orthogonal Butterfly Transforms
Authors: Bingxin Xu, Zhen Dong, Oussama Elachqar, Yuzhang Shang,
Abstract summary: Large language models require massive memory footprints, severely limiting deployment on consumer hardware.<n> Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations.<n>We propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles.
Score: 21.010238822100135
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models require massive memory footprints, severely limiting deployment on consumer hardware. Quantization reduces memory through lower numerical precision, but extreme 2-bit quantization suffers from catastrophic performance loss due to outliers in activations. Rotation-based methods such as QuIP and QuaRot apply orthogonal transforms to eliminate outliers before quantization, using computational invariance: $\mathbf{y} = \mathbf{Wx} = (\mathbf{WQ}^T)(\mathbf{Qx})$ for orthogonal $\mathbf{Q}$. However, these methods use fixed transforms--Hadamard matrices achieving optimal worst-case coherence $\mu = 1/\sqrt{n}$--that cannot adapt to specific weight distributions. We identify that different transformer layers exhibit distinct outlier patterns, motivating layer-adaptive rotations rather than one-size-fits-all approaches. In this work, we propose ButterflyQuant, which replaces Hadamard rotations with learnable butterfly transforms parameterized by continuous Givens rotation angles. Unlike Hadamard's discrete $\{+1, -1\}$ entries that are non-differentiable and thus prohibit gradient-based learning, butterfly transforms' continuous parameterization enables smooth optimization while guaranteeing orthogonality by construction. This orthogonal constraint ensures theoretical guarantees in outlier suppression while achieving $O(n \log n)$ computational complexity with only $\frac{n \log n}{2}$ learnable parameters. We further introduce a uniformity regularization on post-transformation activations to promote smoother distributions amenable to quantization. Learning requires only 128 calibration samples and converges in minutes on a single GPU--a negligible one-time cost. For LLaMA-2-7B with 2-bit quantization, ButterflyQuant achieves 15.4 perplexity versus 37.3 for QuIP. \href{https://github.com/42Shawn/Butterflyquant-llm}{Codes} are available.

Related papers

Decoupling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization [14.136955342888987]
We reform the AdaGrad update and decompose it into a variance adaptation term and a scale-invariant term.<n>This produces $textbfDeVA$ ($textbfV$ariance $textbfA$daptation), a framework that bridges between vector-based variance adaptation and matrix spectral optimization.
arXiv Detail & Related papers (2026-02-06T17:06:42Z)
ButterflyMoE: Sub-Linear Ternary Experts via Structured Butterfly Orbits [0.0]
Current compression methods like quantization, pruning and low-rank factorization reduce constant factors but leave the scaling bottleneck unresolved.<n>We introduce ButterflyMoE, a method that treats experts not as independent weight matrices but as geometric reorientations of a unified quantized substrate.<n>Across language modeling benchmarks, ButterflyMoE achieves 150$times$ memory reduction at 256 experts with negligible accuracy loss.
arXiv Detail & Related papers (2026-01-20T03:39:33Z)
Robust Layerwise Scaling Rules by Proper Weight Decay Tuning [50.11170157029911]
In modern scale-invariant architectures, training quickly enters an degrading-governed steady state.<n>We introduce a weight-decay scaling rule for AdamW that preserves sublayer gain across widths.<n>Our results extend $mu$P beyond the near-init regime by explicitly controlling the steady-state scales set by parameters.
arXiv Detail & Related papers (2025-10-17T02:58:35Z)
FFT-Accelerated Auxiliary Variable MCMC for Fermionic Lattice Models: A Determinant-Free Approach with $O(N\log N)$ Complexity [52.3171766248012]
We introduce a Markov Chain Monte Carlo (MCMC) algorithm that dramatically accelerates the simulation of quantum many-body systems.<n>We validate our algorithm on benchmark quantum physics problems, accurately reproducing known theoretical results.<n>Our work provides a powerful tool for large-scale probabilistic inference and opens avenues for physics-inspired generative models.
arXiv Detail & Related papers (2025-10-13T07:57:21Z)
FlatQuant: Flatness Matters for LLM Quantization [58.28221892035609]
We propose FlatQuant, a new post-training quantization approach that enhances the flatness of weights and activations.<n>Our approach identifies optimal affine transformations for each linear layer, calibrated in hours via a lightweight objective.<n>It achieves less than 1% accuracy drop for W4A4 quantization on the LLaMA-3-70B model, surpassing SpinQuant by 7.5%.
arXiv Detail & Related papers (2024-10-12T08:10:28Z)
DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs [40.48697728884967]
Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. We introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers.
arXiv Detail & Related papers (2024-06-03T18:27:44Z)
Projection by Convolution: Optimal Sample Complexity for Reinforcement Learning in Continuous-Space MDPs [56.237917407785545]
We consider the problem of learning an $varepsilon$-optimal policy in a general class of continuous-space Markov decision processes (MDPs) having smooth Bellman operators. Key to our solution is a novel projection technique based on ideas from harmonic analysis. Our result bridges the gap between two popular but conflicting perspectives on continuous-space MDPs.
arXiv Detail & Related papers (2024-05-10T09:58:47Z)
Adaptive Stochastic Variance Reduction for Non-convex Finite-Sum Minimization [52.25843977506935]
We propose an adaptive variance method, called AdaSpider, for $L$-smooth, non-reduction functions with a finitesum structure. In doing so, we are able to compute an $epsilon-stationary point with $tildeOleft + st/epsilon calls.
arXiv Detail & Related papers (2022-11-03T14:41:46Z)
Statistical Inference of Constrained Stochastic Optimization via Sketched Sequential Quadratic Programming [53.63469275932989]
We consider online statistical inference of constrained nonlinear optimization problems.<n>We apply the Sequential Quadratic Programming (StoSQP) method to solve these problems.
arXiv Detail & Related papers (2022-05-27T00:34:03Z)
Computationally Efficient Horizon-Free Reinforcement Learning for Linear Mixture MDPs [111.75736569611159]
We propose the first computationally efficient horizon-free algorithm for linear mixture MDPs. Our algorithm adapts a weighted least square estimator for the unknown transitional dynamic. This also improves upon the best-known algorithms in this setting when $sigma_k2$'s are known.
arXiv Detail & Related papers (2022-05-23T17:59:18Z)
Truncated phase-based quantum arithmetic: error propagation and resource reduction [0.0]
We present a modification of the Draper quantum Fourier adder which eliminates small-angle rotations to highly coarse levels. We show that the inherited loss of fidelity is directly given by the rate of carry and borrow bits in the subroutine. Surprisingly, we find that each of the $7times 107$ quantum Fourier transforms may be truncated down to $pi/64$, with additive rotations left only slightly finer.
arXiv Detail & Related papers (2021-10-01T05:19:03Z)
Learning with Smooth Hinge Losses [15.288802707471792]
We introduce two smooth Hinge losses $psi_G(alpha;sigma)$ and $psi_M(alpha;sigma)$ which are infinitely differentiable and converge to the Hinge loss uniformly in $alpha$. Experiments in text classification tasks show that the proposed SSVMs are effective in real-world applications.
arXiv Detail & Related papers (2021-02-27T14:50:02Z)
Differentially Quantized Gradient Methods [53.3186247068836]
We show that Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $maxsigma_mathrmGD, rhon 2-R$. No algorithm within a certain class can converge faster than $maxsigma_mathrmGD, 2-R$.
arXiv Detail & Related papers (2020-02-06T20:40:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.