Float8@2bits: Entropy Coding Enables Data-Free Model Compression
- URL: http://arxiv.org/abs/2601.22787v1
- Date: Fri, 30 Jan 2026 10:08:15 GMT
- Title: Float8@2bits: Entropy Coding Enables Data-Free Model Compression
- Authors: Patrick Putzky, Martin Genzel, Mattes Mollenhauer, Sebastian Schulze, Thomas Wollmann, Stefan Dietzel,
- Abstract summary: We introduce EntQuant, the first framework to unite the advantages of different post-training compression regimes.<n>Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes.<n>We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks.
- Score: 4.775539058503235
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Post-training compression is currently divided into two contrasting regimes. On the one hand, fast, data-free, and model-agnostic methods (e.g., NF4 or HQQ) offer maximum accessibility but suffer from functional collapse at extreme bit-rates below 4 bits. On the other hand, techniques leveraging calibration data or extensive recovery training achieve superior fidelity but impose high computational constraints and face uncertain robustness under data distribution shifts. We introduce EntQuant, the first framework to unite the advantages of these distinct paradigms. By matching the performance of data-dependent methods with the speed and universality of data-free techniques, EntQuant enables practical utility in the extreme compression regime. Our method decouples numerical precision from storage cost via entropy coding, compressing a 70B parameter model in less than 30 minutes. We demonstrate that EntQuant does not only achieve state-of-the-art results on standard evaluation sets and models, but also retains functional performance on more complex benchmarks with instruction-tuned models, all at modest inference overhead.
Related papers
- Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - Efficient Feature Compression for Machines with Global Statistics Preservation [5.113857098394778]
In this paper, we employ Z-score normalization to efficiently recover the compressed feature data at the decoder side.<n>Our method supersedes the existing scaling method used by the current standard under development.<n>Experiments show that using our proposed method shows 17.09% reduction in on average across different tasks and up to 65.69% for object tracking.
arXiv Detail & Related papers (2025-12-10T01:51:34Z) - Winning the Pruning Gamble: A Unified Approach to Joint Sample and Token Pruning for Efficient Supervised Fine-Tuning [71.30276778807068]
We propose a unified framework that strategically coordinates sample pruning and token pruning.<n>Q-Tuning achieves a +38% average improvement over the full-data SFT baseline using only 12.5% of the original training data.
arXiv Detail & Related papers (2025-09-28T13:27:38Z) - Generative Latent Diffusion for Efficient Spatiotemporal Data Reduction [11.494915987840876]
Experimental results across multiple datasets show that our method achieves up to 10 times higher compression ratios than rule-based state-of-the-art compressors such as SZ3, and up to 63 percent better performance than leading learning-based methods under the same reconstruction error.
arXiv Detail & Related papers (2025-07-02T20:27:38Z) - Forget the Data and Fine-Tuning! Just Fold the Network to Compress [13.611551223875194]
We introduce model folding, a novel data-free model compression technique that merges structurally similar neurons across layers.<n>We show that model folding achieves comparable performance to data-driven compression techniques and outperforms recently proposed data-free methods.<n>This approach is particularly effective for compressing large-scale models, making it suitable for deployment in resource-constrained environments.
arXiv Detail & Related papers (2025-02-14T15:10:43Z) - Bit-bit encoding, optimizer-free training and sub-net initialization: techniques for scalable quantum machine learning [0.0]
We present a quantum classifier that encodes both the input and the output as binary strings.<n>We show that if one parameter is updated at a time, quantum models can be trained in a way that guarantees convergence to a local minimum.
arXiv Detail & Related papers (2025-01-04T00:35:14Z) - Ares: Approximate Representations via Efficient Sparsification -- A Stateless Approach through Polynomial Homomorphism [1.3824176915623292]
We introduce a stateless compression framework that leverages limiting representations to achieve compact, interpretable and scalable data reduction.<n>Our approach achieves high compression ratios without compromising reconstruction accuracy, all while maintaining simplicity and scalability.
arXiv Detail & Related papers (2024-12-14T00:05:43Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models [52.454274602380124]
Diffusion models heavily depend on the time-step $t$ to achieve satisfactory multi-round denoising.
We propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block.
Powered by the pioneering block design, we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features.
arXiv Detail & Related papers (2023-11-27T12:59:52Z) - Gradient-Free Structured Pruning with Unlabeled Data [57.999191898036706]
We propose a gradient-free structured pruning framework that uses only unlabeled data.
Up to 40% of the original FLOP count can be reduced with less than a 4% accuracy loss across all tasks considered.
arXiv Detail & Related papers (2023-03-07T19:12:31Z) - ClusterQ: Semantic Feature Distribution Alignment for Data-Free
Quantization [111.12063632743013]
We propose a new and effective data-free quantization method termed ClusterQ.
To obtain high inter-class separability of semantic features, we cluster and align the feature distribution statistics.
We also incorporate the intra-class variance to solve class-wise mode collapse.
arXiv Detail & Related papers (2022-04-30T06:58:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.