Related papers: Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats

URL: http://arxiv.org/abs/2508.19263v1
Date: Wed, 20 Aug 2025 12:46:50 GMT
Title: Lossless Compression of Neural Network Components: Weights, Checkpoints, and K/V Caches in Low-Precision Formats
Authors: Anat Heilper, Doron Singer,
Abstract summary: In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4.<n>Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8.<n>We also investigate the compressibility of key-value (K/V) cache tensors used in large language models.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: As deep learning models grow and deployment becomes more widespread, reducing the storage and transmission costs of neural network weights has become increasingly important. While prior work such as ZipNN has shown that lossless compression methods - particularly those based on Huffman encoding floating-point exponents can significantly reduce model sizes, these techniques have primarily been applied to higher-precision formats such as FP32 and BF16. In this work, we extend the ZipNN approach to lower-precision floating-point formats, specifically FP8 and FP4, which are gaining popularity for efficient inference. We design a compression method that separates and compresses the exponent and mantissa components independently using entropy coding. Our evaluation shows compression ratios up to 62% for BF16 and 83% for FP8. We also investigate the compressibility of key-value (K/V) cache tensors used in large language models (LLMs), finding that they, too, exhibit compressible patterns, enabling memory savings during deployment.

Related papers

Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z)
ReCalKV: Low-Rank KV Cache Compression via Head Reordering and Offline Calibration [81.81027217759433]
Large language models (LLMs) are often constrained by the excessive memory required to store the Key-Value ( KV) cache.<n>Recent methods have explored reducing the hidden dimensions of the KV cache, but many introduce additional computation through projection layers.<n>We propose ReCalKV, a post-training KV cache compression method that reduces the hidden dimensions of the KV cache.
arXiv Detail & Related papers (2025-05-30T08:49:27Z)
Optimization of embeddings storage for RAG systems using quantization and dimensionality reduction techniques [0.0]
We show that float8 quantization achieves a 4x storage reduction with minimal performance degradation.<n> PCA emerges as the most effective dimensionality reduction technique.<n>We propose a methodology based on visualizing the performance-storage trade-off space to identify the optimal configuration.
arXiv Detail & Related papers (2025-04-30T18:20:16Z)
ZipNN: Lossless Compression for AI Models [10.111136691015554]
We present ZipNN a lossless compression tailored to neural networks.<n>On popular models (e.g. Llama 3) ZipNN shows space savings that are over 17% better than vanilla compression.<n>We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like Hugging Face.
arXiv Detail & Related papers (2024-11-07T23:28:23Z)
UNComp: Can Matrix Entropy Uncover Sparsity? -- A Compressor Design from an Uncertainty-Aware Perspective [85.08718140718707]
UNComp is an uncertainty-aware framework that uncovers sparsity patterns that can be used for adaptive compression.<n>By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4x.
arXiv Detail & Related papers (2024-10-04T02:32:36Z)
Unlocking Data-free Low-bit Quantization with Matrix Decomposition for KV Cache Compression [87.5604418100301]
Key-value( KV) caching is an important technique to accelerate the inference of large language models. Existing methods often compromise precision or require extra data for calibration. We introduce textbfDecoQuant, a novel data-free low-bit quantization technique based on tensor decomposition methods.
arXiv Detail & Related papers (2024-05-21T08:35:10Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)
Deep Compression for PyTorch Model Deployment on Microcontrollers [0.2578242050187029]
This paper adds model compression, specifically Deep Compression, to Unlu's earlier work on arXiv. In the case of the LeNet-5 model, the memory footprint was reduced by 12.45x, and the inference speed was boosted by 2.57x.
arXiv Detail & Related papers (2021-03-29T22:08:44Z)
An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems [77.88178159830905]
Sparsity-Inducing Distribution-based Compression (SIDCo) is a threshold-based sparsification scheme that enjoys similar threshold estimation quality to deep gradient compression (DGC) Our evaluation shows SIDCo speeds up training by up to 41:7%, 7:6%, and 1:9% compared to the no-compression baseline, Topk, and DGC compressors, respectively.
arXiv Detail & Related papers (2021-01-26T13:06:00Z)
Compression strategies and space-conscious representations for deep neural networks [0.3670422696827526]
Recent advances in deep learning have made available powerful convolutional neural networks (CNN) with state-of-the-art performance in several real-world applications. CNNs have millions of parameters, thus they are not deployable on resource-limited platforms. In this paper, we investigate the impact of lossy compression of CNNs by weight pruning and quantization.
arXiv Detail & Related papers (2020-07-15T19:41:19Z)
Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size. A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator. In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.