Block Pruning For Faster Transformers
- URL: http://arxiv.org/abs/2109.04838v1
- Date: Fri, 10 Sep 2021 12:46:32 GMT
- Title: Block Pruning For Faster Transformers
- Authors: Fran\c{c}ois Lagunas, Ella Charlaix, Victor Sanh, Alexander M. Rush
- Abstract summary: We introduce a block pruning approach targeting both small and fast models.
We find that this approach learns to prune out full components of the underlying model, such as attention heads.
- Score: 89.70392810063247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Pre-training has improved model accuracy for both classification and
generation tasks at the cost of introducing much larger and slower models.
Pruning methods have proven to be an effective way of reducing model size,
whereas distillation methods are proven for speeding up inference. We introduce
a block pruning approach targeting both small and fast models. Our approach
extends structured methods by considering blocks of any size and integrates
this structure into the movement pruning paradigm for fine-tuning. We find that
this approach learns to prune out full components of the underlying model, such
as attention heads. Experiments consider classification and generation tasks,
yielding among other results a pruned model that is a 2.4x faster, 74% smaller
BERT on SQuAD v1, with a 1% drop on F1, competitive both with distilled models
in speed and pruned models in size.
Related papers
- MoE-Pruner: Pruning Mixture-of-Experts Large Language Model using the Hints from Its Router [55.88046193872355]
Mixture-of-Experts (MoE) architectures face challenges such as high memory consumption and redundancy in experts.
We propose MoE-Pruner, a method that prunes weights with the smallest magnitudes multiplied by the corresponding input activations and router weights.
Our pruning method is one-shot, requiring no retraining or weight updates.
arXiv Detail & Related papers (2024-10-15T19:22:27Z) - SlimFlow: Training Smaller One-Step Diffusion Models with Rectified Flow [24.213303324584906]
We develop small, efficient one-step diffusion models based on the powerful rectified flow framework.
We train a one-step diffusion model with an FID of 5.02 and 15.7M parameters, outperforming the previous state-of-the-art one-step diffusion model.
arXiv Detail & Related papers (2024-07-17T16:38:45Z) - T-Stitch: Accelerating Sampling in Pre-Trained Diffusion Models with
Trajectory Stitching [143.72720563387082]
Trajectory Stitching T-Stitch is a simple yet efficient technique to improve the sampling efficiency with little or no generation degradation.
Our key insight is that different diffusion models learn similar encodings under the same training data distribution.
Our method can also be used as a drop-in technique to accelerate the popular pretrained stable diffusion (SD) models.
arXiv Detail & Related papers (2024-02-21T23:08:54Z) - A-SDM: Accelerating Stable Diffusion through Redundancy Removal and
Performance Optimization [54.113083217869516]
In this work, we first explore the computational redundancy part of the network.
We then prune the redundancy blocks of the model and maintain the network performance.
Thirdly, we propose a global-regional interactive (GRI) attention to speed up the computationally intensive attention part.
arXiv Detail & Related papers (2023-12-24T15:37:47Z) - Federated Topic Model and Model Pruning Based on Variational Autoencoder [14.737942599204064]
Federated topic modeling allows multiple parties to jointly train models while protecting data privacy.
This paper proposes a method to establish a federated topic model while ensuring the privacy of each node, and use neural network model pruning to accelerate the model.
Experimental results show that the federated topic model pruning can greatly accelerate the model training speed while ensuring the model's performance.
arXiv Detail & Related papers (2023-11-01T06:00:14Z) - Finding the SWEET Spot: Analysis and Improvement of Adaptive Inference
in Low Resource Settings [6.463202903076821]
We compare the two main approaches for adaptive inference, Early-Exit and Multi-Model, when training data is limited.
Early-Exit provides a better speed-accuracy trade-off due to the overhead of the Multi-Model approach.
We propose SWEET, an Early-Exit fine-tuning method that assigns each classifier its own set of unique model weights.
arXiv Detail & Related papers (2023-06-04T09:16:39Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Movement Pruning: Adaptive Sparsity by Fine-Tuning [115.91907953454034]
Magnitude pruning is a widely used strategy for reducing model size in pure supervised learning.
We propose the use of movement pruning, a simple, deterministic first-order weight pruning method.
Experiments show that when pruning large pretrained language models, movement pruning shows significant improvements in high-sparsity regimes.
arXiv Detail & Related papers (2020-05-15T17:54:15Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.