Related papers: DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization

URL: http://arxiv.org/abs/2511.04063v1
Date: Thu, 06 Nov 2025 05:05:24 GMT
Title: DartQuant: Efficient Rotational Distribution Calibration for LLM Quantization
Authors: Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, Jian Cheng,
Abstract summary: Quantization plays a crucial role in accelerating the inference of large-scale models.<n>DartQuant is an efficient distribution-aware rotational calibration method.<n>It is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU.
Score: 30.092264336180644
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Quantization plays a crucial role in accelerating the inference of large-scale models, and rotational matrices have been shown to effectively improve quantization performance by smoothing outliers. However, end-to-end fine-tuning of rotational optimization algorithms incurs high computational costs and is prone to overfitting. To address this challenge, we propose an efficient distribution-aware rotational calibration method, DartQuant, which reduces the complexity of rotational optimization by constraining the distribution of the activations after rotation. This approach also effectively reduces reliance on task-specific losses, thereby mitigating the risk of overfitting. Additionally, we introduce the QR-Orth optimization scheme, which replaces expensive alternating optimization with a more efficient solution. In a variety of model quantization experiments, DartQuant demonstrates superior performance. Compared to existing methods, it achieves 47$\times$ acceleration and 10$\times$ memory savings for rotational optimization on a 70B model. Furthermore, it is the first to successfully complete rotational calibration for a 70B model on a single 3090 GPU, making quantization of large language models feasible in resource-constrained environments. Code is available at https://github.com/CAS-CLab/DartQuant.git.

Related papers

BASE-Q: Bias and Asymmetric Scaling Enhanced Rotational Quantization for Large Language Models [25.783531928577233]
BASE-Q is a simple yet powerful approach that combines bias correction and asymmetric scaling to reduce rounding and clipping errors.<n>Experiments demonstrate the effectiveness of BASE-Q, narrowing the accuracy gap to full-precision models by 50.5%, 42.9%, and 29.2% compared to QuaRot, SpinQuant, and OSTQuant, respectively.
arXiv Detail & Related papers (2025-05-26T14:22:21Z)
SPAP: Structured Pruning via Alternating Optimization and Penalty Methods [2.1388885579612804]
Large language models (LLMs) are often constrained by their substantial computational and memory demands.<n>We propose SPAP (Structured Pruning via Alternating Optimization and Penalty Methods), a novel and efficient structured pruning framework for LLMs grounded in optimization theory.<n>Our work offers a practical, optimization-driven solution for pruning LLMs while preserving model performance.
arXiv Detail & Related papers (2025-05-06T09:47:53Z)
Optimal Stepsize for Diffusion Sampling [14.849487881523041]
Diffusion models achieve remarkable generation quality but suffer from computational intensive sampling due to suboptimal step discretization.<n>This paper proposes Optimal Stepsize Distillation, a dynamic programming framework that extracts theoretically optimal schedules by distilling knowledge from reference trajectories.<n> Experiments show 10x accelerated text-to-image generation while preserving 99.4% performance on GenEval.
arXiv Detail & Related papers (2025-03-27T17:59:46Z)
RoSTE: An Efficient Quantization-Aware Supervised Fine-Tuning Approach for Large Language Models [53.571195477043496]
We propose an algorithm named Rotated Straight-Through-Estimator (RoSTE)<n>RoSTE combines quantization-aware supervised fine-tuning (QA-SFT) with an adaptive rotation strategy to reduce activation outliers.<n>Our findings reveal that the prediction error is directly proportional to the quantization error of the converged weights, which can be effectively managed through an optimized rotation configuration.
arXiv Detail & Related papers (2025-02-13T06:44:33Z)
Irrational Complex Rotations Empower Low-bit Optimizers [102.56966165088963]
We propose a novel state compression algorithm, namely $pi$-Quant, for memory-efficient training.<n>We show that it can reduce the bit-width of parameters to 3.32-bit, achieving a 75% reduction in parameter scale and a 40% decrease in GPU memory usage.
arXiv Detail & Related papers (2025-01-22T14:17:57Z)
Sample-efficient Bayesian Optimisation Using Known Invariances [56.34916328814857]
We show that vanilla and constrained BO algorithms are inefficient when optimising invariant objectives. We derive a bound on the maximum information gain of these invariant kernels. We use our method to design a current drive system for a nuclear fusion reactor, finding a high-performance solution.
arXiv Detail & Related papers (2024-10-22T12:51:46Z)
Learning to Optimize Permutation Flow Shop Scheduling via Graph-based Imitation Learning [70.65666982566655]
Permutation flow shop scheduling (PFSS) is widely used in manufacturing systems. We propose to train the model via expert-driven imitation learning, which accelerates convergence more stably and accurately. Our model's network parameters are reduced to only 37% of theirs, and the solution gap of our model towards the expert solutions decreases from 6.8% to 1.3% on average.
arXiv Detail & Related papers (2022-10-31T09:46:26Z)
SHINE: SHaring the INverse Estimate from the forward pass for bi-level optimization and implicit models [15.541264326378366]
In recent years, implicit deep learning has emerged as a method to increase the depth of deep neural networks. The training is performed as a bi-level problem, and its computational complexity is partially driven by the iterative inversion of a huge Jacobian matrix. We propose a novel strategy to tackle this computational bottleneck from which many bi-level problems suffer.
arXiv Detail & Related papers (2021-06-01T15:07:34Z)
Extrapolation for Large-batch Training in Deep Learning [72.61259487233214]
We show that a host of variations can be covered in a unified framework that we propose. We prove the convergence of this novel scheme and rigorously evaluate its empirical performance on ResNet, LSTM, and Transformer.
arXiv Detail & Related papers (2020-06-10T08:22:41Z)
Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points. The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding. In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.