Analyzing Redundancy in Pretrained Transformer Models
- URL: http://arxiv.org/abs/2004.04010v2
- Date: Tue, 6 Oct 2020 11:45:07 GMT
- Title: Analyzing Redundancy in Pretrained Transformer Models
- Authors: Fahim Dalvi, Hassan Sajjad, Nadir Durrani and Yonatan Belinkov
- Abstract summary: We define a notion of Redundancy, which we categorize into two classes: General Redundancy and Task-specific Redundancy.
We present an efficient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons.
- Score: 41.07850306314594
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer-based deep NLP models are trained using hundreds of millions of
parameters, limiting their applicability in computationally constrained
environments. In this paper, we study the cause of these limitations by
defining a notion of Redundancy, which we categorize into two classes: General
Redundancy and Task-specific Redundancy. We dissect two popular pretrained
models, BERT and XLNet, studying how much redundancy they exhibit at a
representation-level and at a more fine-grained neuron-level. Our analysis
reveals interesting insights, such as: i) 85% of the neurons across the network
are redundant and ii) at least 92% of them can be removed when optimizing
towards a downstream task. Based on our analysis, we present an efficient
feature-based transfer learning procedure, which maintains 97% performance
while using at-most 10% of the original neurons.
Related papers
- Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks [11.815986153374967]
This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded.
The input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones.
For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks.
arXiv Detail & Related papers (2025-03-13T15:59:03Z) - Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations.
We find that learned representations in a given layer exhibit a degree of diffuse redundancy.
Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z) - Globally Optimal Training of Neural Networks with Threshold Activation
Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations.
We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z) - Boosting Binary Neural Networks via Dynamic Thresholds Learning [21.835748440099586]
We introduce DySign to reduce information loss and boost representative capacity of BNNs.
For DCNNs, DyBCNNs based on two backbones achieve 71.2% and 67.4% top1-accuracy on ImageNet dataset.
For ViTs, DyCCT presents the superiority of the convolutional embedding layer in fully binarized ViTs and 56.1% on the ImageNet dataset.
arXiv Detail & Related papers (2022-11-04T07:18:21Z) - An Experimental Study of the Impact of Pre-training on the Pruning of a
Convolutional Neural Network [0.0]
In recent years, deep neural networks have known a wide success in various application domains.
Deep neural networks usually involve a large number of parameters, which correspond to the weights of the network.
The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights.
arXiv Detail & Related papers (2021-12-15T16:02:15Z) - Neural Network Pruning Through Constrained Reinforcement Learning [3.2880869992413246]
We propose a general methodology for pruning neural networks.
Our proposed methodology can prune neural networks to respect pre-defined computational budgets.
We prove the effectiveness of our approach via comparison with state-of-the-art methods on standard image classification datasets.
arXiv Detail & Related papers (2021-10-16T11:57:38Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Dynamic Neural Diversification: Path to Computationally Sustainable
Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks.
We explore the diversity of the neurons within the hidden layer during the learning process.
We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z) - Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z) - Bayesian Neural Networks at Scale: A Performance Analysis and Pruning
Study [2.3605348648054463]
This work explores the use of high performance computing with distributed training to address the challenges of training BNNs at scale.
We present a performance and scalability comparison of training the VGG-16 and Resnet-18 models on a Cray-XC40 cluster.
arXiv Detail & Related papers (2020-05-23T23:15:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.