Related papers: TuneComp: Joint Fine-tuning and Compression for Large Foundation Models

TuneComp: Joint Fine-tuning and Compression for Large Foundation Models

URL: http://arxiv.org/abs/2505.21835v1
Date: Tue, 27 May 2025 23:49:35 GMT
Title: TuneComp: Joint Fine-tuning and Compression for Large Foundation Models
Authors: Xiangyu Chen, Jing Liu, Ye Wang, Matthew Brand, Pu, Wang, Toshiaki Koike-Akino,
Abstract summary: sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step.<n>We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure.<n> Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.
Score: 50.33925662486034
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: To reduce model size during post-training, compression methods, including knowledge distillation, low-rank approximation, and pruning, are often applied after fine-tuning the model. However, sequential fine-tuning and compression sacrifices performance, while creating a larger than necessary model as an intermediate step. In this work, we aim to reduce this gap, by directly constructing a smaller model while guided by the downstream task. We propose to jointly fine-tune and compress the model by gradually distilling it to a pruned low-rank structure. Experiments demonstrate that joint fine-tuning and compression significantly outperforms other sequential compression methods.

Related papers

Projected Compression: Trainable Projection for Efficient Transformer Compression [2.9812951075697325]
Large language models have steadily increased in size to achieve improved performance.<n>Projected Compression is a novel model compression technique that reduces model weights by utilizing projection modules.<n> Experimental results show that Projected Compression outperforms the comparable hard pruning and retraining approach on higher quality models.
arXiv Detail & Related papers (2025-06-27T14:24:01Z)
Dynamic Base model Shift for Delta Compression [53.505380509713575]
Delta compression attempts to lower the costs by reducing the redundancy of delta parameters.<n>Existing methods by default employ the pretrained model as the base model and compress the delta parameters for every task.<n>We propose Dynamic Base Model Shift (DBMS), which dynamically adapts the base model to the target task before performing delta compression.
arXiv Detail & Related papers (2025-05-16T15:11:19Z)
Choose Your Model Size: Any Compression by a Single Gradient Descent [9.074689052563878]
We present Any Compression via Iterative Pruning (ACIP)<n>ACIP is an algorithmic approach to determine a compression-performance trade-off from a single gradient descent run.<n>We show that ACIP seamlessly complements common quantization-based compression techniques.
arXiv Detail & Related papers (2025-02-03T18:40:58Z)
Lossless and Near-Lossless Compression for Foundation Models [11.307357041746865]
We investigate the source of model compressibility, introduce compression variants tailored for models and categorize models to compressibility groups. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like HuggingFace.
arXiv Detail & Related papers (2024-04-05T16:52:55Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Just CHOP: Embarrassingly Simple LLM Compression [27.64461490974072]
Large language models (LLMs) enable unparalleled few- and zero-shot reasoning capabilities but at a high computational footprint. We show that simple layer pruning coupled with an extended language model pretraining produces state-of-the-art results against structured and even semi-structured compression of models at a 7B scale. We also show how distillation, which has been super effective in task-agnostic compression of smaller BERT-style models, becomes inefficient against our simple pruning technique.
arXiv Detail & Related papers (2023-05-24T08:18:35Z)
Train Flat, Then Compress: Sharpness-Aware Minimization Learns More Compressible Models [7.6356407698088]
Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models. We show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization.
arXiv Detail & Related papers (2022-05-25T11:54:37Z)
What do Compressed Large Language Models Forget? Robustness Challenges in Model Compression [68.82486784654817]
We study two popular model compression techniques including knowledge distillation and pruning. We show that compressed models are significantly less robust than their PLM counterparts on adversarial test sets. We develop a regularization strategy for model compression based on sample uncertainty.
arXiv Detail & Related papers (2021-10-16T00:20:04Z)
Block Pruning For Faster Transformers [89.70392810063247]
We introduce a block pruning approach targeting both small and fast models. We find that this approach learns to prune out full components of the underlying model, such as attention heads.
arXiv Detail & Related papers (2021-09-10T12:46:32Z)
Towards Compact CNNs via Collaborative Compression [166.86915086497433]
We propose a Collaborative Compression scheme, which joints channel pruning and tensor decomposition to compress CNN models. We achieve 52.9% FLOPs reduction by removing 48.4% parameters on ResNet-50 with only a Top-1 accuracy drop of 0.56% on ImageNet 2012.
arXiv Detail & Related papers (2021-05-24T12:07:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.