Pruning Redundant Mappings in Transformer Models via Spectral-Normalized
Identity Prior
- URL: http://arxiv.org/abs/2010.01791v1
- Date: Mon, 5 Oct 2020 05:40:56 GMT
- Title: Pruning Redundant Mappings in Transformer Models via Spectral-Normalized
Identity Prior
- Authors: Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth
- Abstract summary: spectral-normalized identity priors (SNIP) is a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping.
We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance.
- Score: 54.629850694790036
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Traditional (unstructured) pruning methods for a Transformer model focus on
regularizing the individual weights by penalizing them toward zero. In this
work, we explore spectral-normalized identity priors (SNIP), a structured
pruning approach that penalizes an entire residual module in a Transformer
model toward an identity mapping. Our method identifies and discards
unimportant non-linear mappings in the residual connections by applying a
thresholding operator on the function norm. It is applicable to any structured
module, including a single attention head, an entire attention block, or a
feed-forward subnetwork. Furthermore, we introduce spectral normalization to
stabilize the distribution of the post-activation values of the Transformer
layers, further improving the pruning effectiveness of the proposed
methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to
demonstrate that SNIP achieves effective pruning results while maintaining
comparable performance. Specifically, we improve the performance over the
state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio.
Related papers
- UnitNorm: Rethinking Normalization for Transformers in Time Series [9.178527914585446]
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks.
We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns.
UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection.
arXiv Detail & Related papers (2024-05-24T19:58:25Z) - Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization [0.0]
We show how PSiLON Net's design drastically simplifies the 1-path-norm.
We propose a pruning method to achieve exact sparsity in the final stages of training.
arXiv Detail & Related papers (2024-04-29T21:25:25Z) - Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training.
We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z) - Entropy Transformer Networks: A Learning Approach via Tangent Bundle
Data Manifold [8.893886200299228]
This paper focuses on an accurate and fast approach for image transformation employed in the design of CNN architectures.
A novel Entropy STN (ESTN) is proposed that interpolates on the data manifold distributions.
Experiments on challenging benchmarks show that the proposed ESTN can improve predictive accuracy over a range of computer vision tasks.
arXiv Detail & Related papers (2023-07-24T04:21:51Z) - Deterministic Decoupling of Global Features and its Application to Data
Analysis [0.0]
We propose a new formalism that is based on defining transformations on submanifolds.
Through these transformations we define a normalization that, we demonstrate, allows for decoupling differentiable features.
We apply this method in the original data domain and at the output of a filter bank to regression and classification problems based on global descriptors.
arXiv Detail & Related papers (2022-07-05T15:54:39Z) - Counterbalancing Teacher: Regularizing Batch Normalized Models for
Robustness [15.395021925719817]
Batch normalization (BN) is a technique for training deep neural networks that accelerates their convergence to reach higher accuracy.
We show that BN incentivizes the model to rely on low-variance features that are highly specific to the training (in-domain) data.
We propose Counterbalancing Teacher (CT) to enforce the student network's learning of robust representations.
arXiv Detail & Related papers (2022-07-04T16:16:24Z) - Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage.
We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction.
Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z) - Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI)
Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z) - Improve Generalization and Robustness of Neural Networks via Weight
Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions.
We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z) - Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1.
This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI)
We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction.
We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.