Related papers: Analyzing Redundancy in Pretrained Transformer Models

Analyzing Redundancy in Pretrained Transformer Models

URL: http://arxiv.org/abs/2004.04010v2
Date: Tue, 6 Oct 2020 11:45:07 GMT
Title: Analyzing Redundancy in Pretrained Transformer Models
Authors: Fahim Dalvi, Hassan Sajjad, Nadir Durrani and Yonatan Belinkov
Abstract summary: We define a notion of Redundancy, which we categorize into two classes: General Redundancy and Task-specific Redundancy. We present an efficient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons.
Score: 41.07850306314594
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based deep NLP models are trained using hundreds of millions of parameters, limiting their applicability in computationally constrained environments. In this paper, we study the cause of these limitations by defining a notion of Redundancy, which we categorize into two classes: General Redundancy and Task-specific Redundancy. We dissect two popular pretrained models, BERT and XLNet, studying how much redundancy they exhibit at a representation-level and at a more fine-grained neuron-level. Our analysis reveals interesting insights, such as: i) 85% of the neurons across the network are redundant and ii) at least 92% of them can be removed when optimizing towards a downstream task. Based on our analysis, we present an efficient feature-based transfer learning procedure, which maintains 97% performance while using at-most 10% of the original neurons.

Related papers

Developing Training Procedures for Piecewise-linear Spline Activation Functions in Neural Networks [0.0]
I present and compare 9 training methodologies to explore dual-optimization dynamics in neural networks.<n>Experiments realize up to 94% lower end model error rates in FNNs and 51% lower rates in CNNs compared to traditional ReLU-based models.
arXiv Detail & Related papers (2025-09-17T03:51:16Z)
Explainable Bayesian deep learning through input-skip Latent Binary Bayesian Neural Networks [11.815986153374967]
This article advances LBBNNs by enabling covariates to skip to any succeeding layer or be excluded. The input-skip LBBNN approach reduces network density significantly compared to standard LBBNNs, achieving over 99% reduction for small networks and over 99.9% for larger ones. For example, on MNIST, we reached 97% accuracy and great calibration with just 935 weights, reaching state-of-the-art for compression of neural networks.
arXiv Detail & Related papers (2025-03-13T15:59:03Z)
Diffused Redundancy in Pre-trained Representations [98.55546694886819]
We take a closer look at how features are encoded in pre-trained representations. We find that learned representations in a given layer exhibit a degree of diffuse redundancy. Our findings shed light on the nature of representations learned by pre-trained deep neural networks.
arXiv Detail & Related papers (2023-05-31T21:00:50Z)
Globally Optimal Training of Neural Networks with Threshold Activation Functions [63.03759813952481]
We study weight decay regularized training problems of deep neural networks with threshold activations. We derive a simplified convex optimization formulation when the dataset can be shattered at a certain layer of the network.
arXiv Detail & Related papers (2023-03-06T18:59:13Z)
Boosting Binary Neural Networks via Dynamic Thresholds Learning [21.835748440099586]
We introduce DySign to reduce information loss and boost representative capacity of BNNs. For DCNNs, DyBCNNs based on two backbones achieve 71.2% and 67.4% top1-accuracy on ImageNet dataset. For ViTs, DyCCT presents the superiority of the convolutional embedding layer in fully binarized ViTs and 56.1% on the ImageNet dataset.
arXiv Detail & Related papers (2022-11-04T07:18:21Z)
An Experimental Study of the Impact of Pre-training on the Pruning of a Convolutional Neural Network [0.0]
In recent years, deep neural networks have known a wide success in various application domains. Deep neural networks usually involve a large number of parameters, which correspond to the weights of the network. The pruning methods notably attempt to reduce the size of the parameter set, by identifying and removing the irrelevant weights.
arXiv Detail & Related papers (2021-12-15T16:02:15Z)
Neural Network Pruning Through Constrained Reinforcement Learning [3.2880869992413246]
We propose a general methodology for pruning neural networks. Our proposed methodology can prune neural networks to respect pre-defined computational budgets. We prove the effectiveness of our approach via comparison with state-of-the-art methods on standard image classification datasets.
arXiv Detail & Related papers (2021-10-16T11:57:38Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Dynamic Neural Diversification: Path to Computationally Sustainable Neural Networks [68.8204255655161]
Small neural networks with a constrained number of trainable parameters, can be suitable resource-efficient candidates for many simple tasks. We explore the diversity of the neurons within the hidden layer during the learning process. We analyze how the diversity of the neurons affects predictions of the model.
arXiv Detail & Related papers (2021-09-20T15:12:16Z)
Learning Neural Network Subspaces [74.44457651546728]
Recent observations have advanced our understanding of the neural network optimization landscape. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks. With a similar computational cost as training one model, we learn lines, curves, and simplexes of high-accuracy neural networks.
arXiv Detail & Related papers (2021-02-20T23:26:58Z)
Bayesian Neural Networks at Scale: A Performance Analysis and Pruning Study [2.3605348648054463]
This work explores the use of high performance computing with distributed training to address the challenges of training BNNs at scale. We present a performance and scalability comparison of training the VGG-16 and Resnet-18 models on a Cray-XC40 cluster.
arXiv Detail & Related papers (2020-05-23T23:15:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.