Deep Neural Compression Via Concurrent Pruning and Self-Distillation
- URL: http://arxiv.org/abs/2109.15014v1
- Date: Thu, 30 Sep 2021 11:08:30 GMT
- Title: Deep Neural Compression Via Concurrent Pruning and Self-Distillation
- Authors: James O' Neill, Sourav Dutta, Haytham Assem
- Abstract summary: Pruning aims to reduce the number of parameters while maintaining performance close to the original network.
This work proposes a novel emphself-distillation based pruning strategy.
We show that the proposed em cross-correlation objective for self-distilled pruning implicitly encourages sparse solutions.
- Score: 7.448510589632587
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Pruning aims to reduce the number of parameters while maintaining performance
close to the original network. This work proposes a novel
\emph{self-distillation} based pruning strategy, whereby the representational
similarity between the pruned and unpruned versions of the same network is
maximized. Unlike previous approaches that treat distillation and pruning
separately, we use distillation to inform the pruning criteria, without
requiring a separate student network as in knowledge distillation. We show that
the proposed {\em cross-correlation objective for self-distilled pruning}
implicitly encourages sparse solutions, naturally complementing magnitude-based
pruning criteria. Experiments on the GLUE and XGLUE benchmarks show that
self-distilled pruning increases mono- and cross-lingual language model
performance. Self-distilled pruned models also outperform smaller Transformers
with an equal number of parameters and are competitive against (6 times) larger
distilled networks. We also observe that self-distillation (1) maximizes class
separability, (2) increases the signal-to-noise ratio, and (3) converges faster
after pruning steps, providing further insights into why self-distilled pruning
improves generalization.
Related papers
- Isomorphic Pruning for Vision Models [56.286064975443026]
Structured pruning reduces the computational overhead of deep neural networks by removing redundant sub-structures.
We present Isomorphic Pruning, a simple approach that demonstrates effectiveness across a range of network architectures.
arXiv Detail & Related papers (2024-07-05T16:14:53Z) - Theoretical Characterization of How Neural Network Pruning Affects its
Generalization [131.1347309639727]
This work makes the first attempt to study how different pruning fractions affect the model's gradient descent dynamics and generalization.
It is shown that as long as the pruning fraction is below a certain threshold, gradient descent can drive the training loss toward zero.
More surprisingly, the generalization bound gets better as the pruning fraction gets larger.
arXiv Detail & Related papers (2023-01-01T03:10:45Z) - Gradient-based Intra-attention Pruning on Pre-trained Language Models [21.444503777215637]
We propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning)
GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models.
Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime.
arXiv Detail & Related papers (2022-12-15T06:52:31Z) - Interpretations Steered Network Pruning via Amortized Inferred Saliency
Maps [85.49020931411825]
Convolutional Neural Networks (CNNs) compression is crucial to deploying these models in edge devices with limited resources.
We propose to address the channel pruning problem from a novel perspective by leveraging the interpretations of a model to steer the pruning process.
We tackle this challenge by introducing a selector model that predicts real-time smooth saliency masks for pruned models.
arXiv Detail & Related papers (2022-09-07T01:12:11Z) - Structured Pruning Learns Compact and Accurate Models [28.54826400747667]
We propose a task-specific structured pruning method CoFi (Coarse- and Fine-grained Pruning)
CoFi delivers highly parallelizableworks and matches the distillation methods in both accuracy and latency.
Our experiments on GLUE and SQuAD datasets show that CoFi yields models with over 10x speedups with a small accuracy drop.
arXiv Detail & Related papers (2022-04-01T13:09:56Z) - Sparse Progressive Distillation: Resolving Overfitting under
Pretrain-and-Finetune Paradigm [7.662952656290564]
Various pruning approaches have been proposed to reduce the footprint requirements of Transformer-based language models.
We show for the first time that reducing the risk of overfitting can help the effectiveness of pruning under the pretrain-and-finetune paradigm.
arXiv Detail & Related papers (2021-10-15T16:42:56Z) - Sparse Training via Boosting Pruning Plasticity with Neuroregeneration [79.78184026678659]
We study the effect of pruning throughout training from the perspective of pruning plasticity.
We design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet) and its dynamic sparse training (DST) variant (GraNet-ST)
Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet.
arXiv Detail & Related papers (2021-06-19T02:09:25Z) - MLPruning: A Multilevel Structured Pruning Framework for
Transformer-based Models [78.45898846056303]
Pruning is an effective method to reduce the memory footprint and computational cost associated with large natural language processing models.
We develop a novel MultiLevel structured Pruning framework, which uses three different levels of structured pruning: head pruning, row pruning, and block-wise sparse pruning.
arXiv Detail & Related papers (2021-05-30T22:00:44Z) - Even your Teacher Needs Guidance: Ground-Truth Targets Dampen
Regularization Imposed by Self-Distillation [0.0]
Self-distillation, where the network architectures are identical, has been observed to improve generalization accuracy.
We consider an iterative variant of self-distillation in a kernel regression setting, in which successive steps incorporate both model outputs and the ground-truth targets.
We show that any such function obtained with self-distillation can be calculated directly as a function of the initial fit, and that infinite distillation steps yields the same optimization problem as the original with amplified regularization.
arXiv Detail & Related papers (2021-02-25T18:56:09Z) - Neural Pruning via Growing Regularization [82.9322109208353]
We extend regularization to tackle two central problems of pruning: pruning schedule and weight importance scoring.
Specifically, we propose an L2 regularization variant with rising penalty factors and show it can bring significant accuracy gains.
The proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning.
arXiv Detail & Related papers (2020-12-16T20:16:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.