ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking
- URL: http://arxiv.org/abs/2406.11257v1
- Date: Mon, 17 Jun 2024 06:47:29 GMT
- Title: ExCP: Extreme LLM Checkpoint Compression via Weight-Momentum Joint Shrinking
- Authors: Wenshuo Li, Xinghao Chen, Han Shu, Yehui Tang, Yunhe Wang,
- Abstract summary: Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence.
We propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints.
We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance.
- Score: 39.02269424136506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLM) have recently attracted significant attention in the field of artificial intelligence. However, the training process of these models poses significant challenges in terms of computational and storage capacities, thus compressing checkpoints has become an urgent problem. In this paper, we propose a novel Extreme Checkpoint Compression (ExCP) framework, which significantly reduces the required storage of training checkpoints while achieving nearly lossless performance. We first calculate the residuals of adjacent checkpoints to obtain the essential but sparse information for higher compression ratio. To further excavate the redundancy parameters in checkpoints, we then propose a weight-momentum joint shrinking method to utilize another important information during the model optimization, i.e., momentum. In particular, we exploit the information of both model and optimizer to discard as many parameters as possible while preserving critical information to ensure optimal performance. Furthermore, we utilize non-uniform quantization to further compress the storage of checkpoints. We extensively evaluate our proposed ExCP framework on several models ranging from 410M to 7B parameters and demonstrate significant storage reduction while maintaining strong performance. For instance, we achieve approximately $70\times$ compression for the Pythia-410M model, with the final performance being as accurate as the original model on various downstream tasks. Codes will be available at https://github.com/Gaffey/ExCP.
Related papers
- An Efficient Compression of Deep Neural Network Checkpoints Based on Prediction and Context Modeling [1.7495213911983414]
We propose a prediction-based compression approach, where values from the previously saved checkpoint are used for context modeling in arithmetic coding.<n> Experimental results show that our approach achieves substantial bit size reduction, while enabling near-lossless training recovery from restored checkpoints.
arXiv Detail & Related papers (2025-06-13T17:54:42Z) - Lossless Compression for LLM Tensor Incremental Snapshots [0.0]
We build an effective compression solution, known as Language Model (LMC)<n>We show that a 16-core parallel implementation of LMC can attain compression and decompression throughput of 2.78 GiB/s and 3.76 GiB/s respectively.<n>This increase in performance ultimately reduces the resources needed and provides more time to copy the data to the storage system before the next epoch thus allowing for higher-frequency checkpoints.
arXiv Detail & Related papers (2025-05-14T21:24:14Z) - EntroLLM: Entropy Encoded Weight Compression for Efficient Large Language Model Inference on Edge Devices [3.5240021321113204]
Large Language Models (LLMs) demonstrate exceptional performance across various tasks, but their large storage and computational requirements constrain their deployment on edge devices.<n>We propose EntroLLM, a novel compression framework that integrates mixed quantization with entropy coding to reduce storage overhead while maintaining model accuracy.
arXiv Detail & Related papers (2025-05-05T05:42:14Z) - Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models.
We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z) - When Compression Meets Model Compression: Memory-Efficient Double Compression for Large Language Models [12.687035979970194]
This paper introduces a framework to compress large language models (LLMs) after quantization.
A compression-aware quantization is first proposed to enhance model weight compressibility by re-scaling the model parameters before quantization, followed by a pruning method to improve further.
Experiments show inference with the compressed model can achieve a 40% reduction in memory size with negligible loss in accuracy and inference speed.
arXiv Detail & Related papers (2025-02-21T13:11:22Z) - Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)
ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.
We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z) - LoRC: Low-Rank Compression for LLMs KV Cache with a Progressive Compression Strategy [59.1298692559785]
Key-Value ( KV) cache is crucial component in serving transformer-based autoregressive large language models (LLMs)
Existing approaches to mitigate this issue include: (1) efficient attention variants integrated in upcycling stages; (2) KV cache compression at test time; and (3) KV cache compression at test time.
We propose a low-rank approximation of KV weight matrices, allowing plug-in integration with existing transformer-based LLMs without model retraining.
Our method is designed to function without model tuning in upcycling stages or task-specific profiling in test stages.
arXiv Detail & Related papers (2024-10-04T03:10:53Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - The Cost of Compression: Investigating the Impact of Compression on
Parametric Knowledge in Language Models [11.156816338995503]
Large language models (LLMs) provide faster inference, smaller memory footprints, and enables local deployment.
Two standard compression techniques are pruning and quantization, with the former eliminating redundant connections in model layers and the latter representing model parameters with fewer bits.
Existing research on LLM compression primarily focuses on performance in terms of general metrics like perplexity or downstream task accuracy.
More fine-grained metrics, such as those measuring parametric knowledge, remain significantly underexplored.
arXiv Detail & Related papers (2023-12-01T22:27:12Z) - Inshrinkerator: Compressing Deep Learning Training Checkpoints via Dynamic Quantization [5.648270790530862]
State-of-the-art approaches involve lossy model compression mechanisms, which induce a tradeoff between the resulting model quality (accuracy) and compression ratio.
We make a key enabling observation that the sensitivity of model weights to compression varies during training, and different weights benefit from different quantization levels.
We propose a non-uniform quantization scheme that leverages this variation, an efficient search mechanism that dynamically finds the best quantization configurations, and a quantization-aware delta compression mechanism that rearranges weights to minimize checkpoint differences.
arXiv Detail & Related papers (2023-06-20T18:00:31Z) - Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM
Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models.
We propose a soft prompt learning method where we expose the compressed model to the prompt learning process.
Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z) - When Liebig's Barrel Meets Facial Landmark Detection: A Practical Model [87.25037167380522]
We propose a model that is accurate, robust, efficient, generalizable, and end-to-end trainable.
In order to achieve a better accuracy, we propose two lightweight modules.
DQInit dynamically initializes the queries of decoder from the inputs, enabling the model to achieve as good accuracy as the ones with multiple decoder layers.
QAMem is designed to enhance the discriminative ability of queries on low-resolution feature maps by assigning separate memory values to each query rather than a shared one.
arXiv Detail & Related papers (2021-05-27T13:51:42Z) - Check-N-Run: A Checkpointing System for Training Deep Learning
Recommendation Models [5.604501524927757]
We present Check-N-Run, a scalable checkpointing system for training large machine learning models at Facebook.
Check-N-Run uses two primary techniques to address the size and bandwidth challenges.
These techniques allow Check-N-Run to reduce the required write bandwidth by 6-17x and the required capacity by 2.5-8x on real-world models.
arXiv Detail & Related papers (2020-10-17T00:45:55Z) - On Efficient Constructions of Checkpoints [21.965296582303115]
We propose a lossy compression scheme for checkpoint constructions (called LC-Checkpoint)
LC-Checkpoint simultaneously maximizes the compression rate and optimize the recovery speed.
Our experiments show that LC-Checkpoint achieves a compression rate up to $28times$ and recovery speedup up to $5.77times$ over a state-of-the-art algorithm (SCAR)
arXiv Detail & Related papers (2020-09-28T01:20:15Z) - Training with Quantization Noise for Extreme Model Compression [57.51832088938618]
We tackle the problem of producing compact models, maximizing their accuracy for a given model size.
A standard solution is to train networks with Quantization Aware Training, where the weights are quantized during training and the gradients approximated with the Straight-Through Estimator.
In this paper, we extend this approach to work beyond int8 fixed-point quantization with extreme compression methods.
arXiv Detail & Related papers (2020-04-15T20:10:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.