Structured Pruning Learns Compact and Accurate Models
- URL: http://arxiv.org/abs/2204.00408v1
- Date: Fri, 1 Apr 2022 13:09:56 GMT
- Title: Structured Pruning Learns Compact and Accurate Models
- Authors: Mengzhou Xia, Zexuan Zhong, Danqi Chen
- Abstract summary: We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
- Score: 28.54826400747667
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The growing size of neural language models has led to increased attention in
model compression. The two predominant approaches are pruning, which gradually
removes weights from a pre-trained model, and distillation, which trains a
smaller compact model to match a larger one. Pruning methods can significantly
reduce the model size but hardly achieve large speedups as distillation.
However, distillation methods require large amounts of unlabeled data and are
expensive to train. In this work, we propose a task-specific structured pruning
method CoFi (Coarse- and Fine-grained Pruning), which delivers highly
parallelizable subnetworks and matches the distillation methods in both
accuracy and latency, without resorting to any unlabeled data. Our key insight
is to jointly prune coarse-grained (e.g., layers) and fine-grained (e.g., heads
and hidden units) modules, which controls the pruning decision of each
parameter with masks of different granularity. We also devise a layerwise
distillation strategy to transfer knowledge from unpruned to pruned models
during optimization. Our experiments on GLUE and SQuAD datasets show that CoFi
yields models with over 10x speedups with a small accuracy drop, showing its
effectiveness and efficiency compared to previous pruning and distillation
approaches.
Related papers
- Generative Dataset Distillation Based on Self-knowledge Distillation [49.20086587208214]
We present a novel generative dataset distillation method that can improve the accuracy of aligning prediction logits.
Our approach integrates self-knowledge distillation to achieve more precise distribution matching between the synthetic and original data.
Our method outperforms existing state-of-the-art methods, resulting in superior distillation performance.
arXiv Detail & Related papers (2025-01-08T00:43:31Z) - Efficient Dataset Distillation via Diffusion-Driven Patch Selection for Improved Generalization [34.79567392368196]
We propose a novel framework to existing diffusion-based distillation methods, leveraging diffusion models for selection rather than generation.
Our method starts by predicting noise generated by the diffusion model based on input images and text prompts, then calculates the corresponding loss for each pair.
This streamlined framework enables a single-step distillation process, and extensive experiments demonstrate that our approach outperforms state-of-the-art methods across various metrics.
arXiv Detail & Related papers (2024-12-13T08:34:46Z) - TT-MPD: Test Time Model Pruning and Distillation [3.675015670568961]
Pruning can be an effective method of compressing large pre-trained models for inference speed acceleration.
Previous pruning approaches rely on access to the original training dataset for both pruning and subsequent fine-tuning.
We propose an efficient pruning method that considers the approximated finetuned accuracy and potential inference latency saving.
arXiv Detail & Related papers (2024-12-10T02:05:13Z) - One-Step Diffusion Distillation via Deep Equilibrium Models [64.11782639697883]
We introduce a simple yet effective means of distilling diffusion models directly from initial noise to the resulting image.
Our method enables fully offline training with just noise/image pairs from the diffusion model.
We demonstrate that the DEQ architecture is crucial to this capability, as GET matches a $5times$ larger ViT in terms of FID scores.
arXiv Detail & Related papers (2023-12-12T07:28:40Z) - BOOT: Data-free Distillation of Denoising Diffusion Models with
Bootstrapping [64.54271680071373]
Diffusion models have demonstrated excellent potential for generating diverse images.
Knowledge distillation has been recently proposed as a remedy that can reduce the number of inference steps to one or a few.
We present a novel technique called BOOT, that overcomes limitations with an efficient data-free distillation algorithm.
arXiv Detail & Related papers (2023-06-08T20:30:55Z) - Structural Pruning for Diffusion Models [65.02607075556742]
We present Diff-Pruning, an efficient compression method tailored for learning lightweight diffusion models from pre-existing ones.
Our empirical assessment, undertaken across several datasets highlights two primary benefits of our proposed method.
arXiv Detail & Related papers (2023-05-18T12:38:21Z) - Train Flat, Then Compress: Sharpness-Aware Minimization Learns More
Compressible Models [7.6356407698088]
Pruning unnecessary parameters has emerged as a simple and effective method for compressing large models.
We show that optimizing for flat minima consistently leads to greater compressibility of parameters compared to standard Adam optimization.
arXiv Detail & Related papers (2022-05-25T11:54:37Z) - Block Pruning For Faster Transformers [89.70392810063247]
We introduce a block pruning approach targeting both small and fast models.
We find that this approach learns to prune out full components of the underlying model, such as attention heads.
arXiv Detail & Related papers (2021-09-10T12:46:32Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.