To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
- URL: http://arxiv.org/abs/2510.02676v1
- Date: Fri, 03 Oct 2025 02:22:13 GMT
- Title: To Compress or Not? Pushing the Frontier of Lossless GenAI Model Weights Compression with Exponent Concentration
- Authors: Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, Anshumali Shrivastava,
- Abstract summary: Low-precision floating-point formats provide stability, memory savings, and hardware efficiency without dequantization overhead.<n>We propose Exponent-Concentrated FP8 (ECF8), a compression framework with entropy-aware encoding and GPU-optimized decoding.<n>Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration.
- Score: 46.63567524455431
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The scaling of Generative AI (GenAI) models into the hundreds of billions of parameters makes low-precision computation indispensable for efficient deployment. We argue that the fundamental solution lies in developing low-precision floating-point formats, which inherently provide numerical stability, memory savings, and hardware efficiency without dequantization overhead. In this paper, we present a theoretical and empirical study of an exponent concentration phenomenon in GenAI weights: exponents consistently exhibit low entropy across architectures and modalities. We show that this arises naturally from $\alpha$-stable distributions induced by stochastic gradient descent, and we prove tight bounds on the entropy of exponents. Our analysis establishes a theoretical compression limit near FP4.67, which motivates the design of a practical FP8 format. Building on these insights, we propose Exponent-Concentrated FP8 (ECF8), a lossless compression framework with entropy-aware encoding and GPU-optimized decoding. Experiments on LLMs and DiTs up to 671B parameters demonstrate up to 26.9% memory savings and 177.1% throughput acceleration, with perfectly lossless computations, i.e., no deviation in model outputs. Our results establish exponent concentration as a statistical law of trained models and open a principled path for lossless low-precision floating-point design in the FP8 era.
Related papers
- Tensor-Compressed and Fully-Quantized Training of Neural PDE Solvers [10.320585073024455]
We present a framework that enables scalable and energy-efficient PINN training on edge devices.<n>This work enables real-time PDE solving on edge devices and paves the way for energy-efficient scientific computing at scale.
arXiv Detail & Related papers (2025-12-10T00:00:34Z) - Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation [50.21021246855702]
We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs)<n>Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.<n>Our results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
arXiv Detail & Related papers (2025-11-21T08:12:47Z) - A Convergence Analysis of Adaptive Optimizers under Floating-point Quantization [32.97211471008323]
We introduce the first theoretical framework of adaptive convergences, including Adam and Muon, under floating-point quantization of gradients, weights, and states.<n>We show that both algorithms retain convergence rates close to their full-precision counterparts provided mantissa length scales only logarithmically with the number of iterations.<n>Our analysis further reveals that Adam is highly sensitive to and second-moment quantization weights due to its reliance on $beta to 1$, while Muon requires weaker error control and is thus potentially more robust.
arXiv Detail & Related papers (2025-10-24T10:16:23Z) - Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats [0.0]
In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4.<n>Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8.<n>We also investigate the compressibility of key-value (K/V) cache tensors used in large language models.
arXiv Detail & Related papers (2025-08-20T12:46:50Z) - First-Order Error Matters: Accurate Compensation for Quantized Large Language Models [32.69069234109942]
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs)<n>Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error.<n>We propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation.
arXiv Detail & Related papers (2025-07-15T06:18:46Z) - Flow Matching Meets PDEs: A Unified Framework for Physics-Constrained Generation [21.321570407292263]
We propose Physics-Based Flow Matching, a generative framework that embeds physical constraints, both PDE residuals and algebraic relations, into the flow matching objective.<n>We show that our approach yields up to an $8times$ more accurate physical residuals compared to FM, while clearly outperforming existing algorithms in terms of distributional accuracy.
arXiv Detail & Related papers (2025-06-10T09:13:37Z) - Unified Scaling Laws for Compressed Representations [69.72517034565467]
We investigate whether a unified scaling framework can accurately predict model performance when training occurs over various compressed representations.<n>Our main finding is demonstrating both theoretically and empirically that there exists a simple "capacity" metric.<n>We extend our formulation to directly compare the accuracy potential of different compressed formats, and to derive better algorithms for training over sparse-quantized formats.
arXiv Detail & Related papers (2025-06-02T16:52:51Z) - Data-free Weight Compress and Denoise for Large Language Models [96.68582094536032]
We propose a novel approach termed Data-free Joint Rank-k Approximation for compressing the parameter matrices.<n>We achieve a model pruning of 80% parameters while retaining 93.43% of the original performance without any calibration data.
arXiv Detail & Related papers (2024-02-26T05:51:47Z) - From PEFT to DEFT: Parameter Efficient Finetuning for Reducing Activation Density in Transformers [52.199303258423306]
We propose a novel density loss that encourages higher activation sparsity in pre-trained models.
Our proposed method, textbfDEFT, can consistently reduce activation density by up to textbf44.94% on RoBERTa$_mathrmLarge$ and by textbf53.19% (encoder density) and textbf90.60% (decoder density) on Flan-T5$_mathrmXXL$.
arXiv Detail & Related papers (2024-02-02T21:25:46Z) - ZeroQuant-FP: A Leap Forward in LLMs Post-Training W4A8 Quantization
Using Floating-Point Formats [25.543571445739936]
This study explores the viability of floating-point (FP) quantization for large language models (LLMs)
For LLMs, FP8 activation consistently outshines its integer (INT8) equivalent, with the performance edge becoming more noticeable in models possessing parameters beyond one billion.
For weight quantization, our findings indicate that FP4 exhibits comparable, if not superior, performance to INT4, simplifying deployment on FP-supported hardware like H100.
arXiv Detail & Related papers (2023-07-19T06:58:03Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.