Related papers: Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior

URL: http://arxiv.org/abs/2010.01791v1
Date: Mon, 5 Oct 2020 05:40:56 GMT
Title: Pruning Redundant Mappings in Transformer Models via Spectral-Normalized Identity Prior
Authors: Zi Lin, Jeremiah Zhe Liu, Zi Yang, Nan Hua, Dan Roth
Abstract summary: spectral-normalized identity priors (SNIP) is a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance.
Score: 54.629850694790036
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Traditional (unstructured) pruning methods for a Transformer model focus on regularizing the individual weights by penalizing them toward zero. In this work, we explore spectral-normalized identity priors (SNIP), a structured pruning approach that penalizes an entire residual module in a Transformer model toward an identity mapping. Our method identifies and discards unimportant non-linear mappings in the residual connections by applying a thresholding operator on the function norm. It is applicable to any structured module, including a single attention head, an entire attention block, or a feed-forward subnetwork. Furthermore, we introduce spectral normalization to stabilize the distribution of the post-activation values of the Transformer layers, further improving the pruning effectiveness of the proposed methodology. We conduct experiments with BERT on 5 GLUE benchmark tasks to demonstrate that SNIP achieves effective pruning results while maintaining comparable performance. Specifically, we improve the performance over the state-of-the-art by 0.5 to 1.0% on average at 50% compression ratio.

Related papers

Modes of Sequence Models and Learning Coefficients [0.6906005491572401]
We develop a geometric account of sequence modelling that links patterns in the data to measurable properties of the loss landscape in transformer networks. We show theoretically that Local Learning Coefficient estimates are insensitive to modes below a data-dependent threshold. This insight clarifies why reliable LLC estimates can be obtained even when a network parameter is not a strict minimiser of the population loss.
arXiv Detail & Related papers (2025-04-25T03:38:10Z)
Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks. While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers. We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z)
Adaptive Pruning of Pretrained Transformer via Differential Inclusions [48.47890215458465]
Current compression algorithms prune transformers at fixed compression ratios, requiring a unique pruning process for each ratio. We propose pruning of pretrained transformers at any desired ratio within a single pruning stage, based on a differential inclusion for a mask parameter. This dynamic can generate the whole regularization solution path of the mask parameter, whose support set identifies the network structure.
arXiv Detail & Related papers (2025-01-06T06:34:52Z)
ADMM Based Semi-Structured Pattern Pruning Framework For Transformer [4.02487511510606]
This paper introduces Alternating Direction Method of Multipliers(ADMM) based pattern pruning framework to reshape the distribution of activation map. We conduct extensive experiments on classification tasks over GLUE dataset. We achieve 50% percent compression ratio while maintaining overall score 80.1 on GLUE dataset.
arXiv Detail & Related papers (2024-07-11T09:35:08Z)
UnitNorm: Rethinking Normalization for Transformers in Time Series [9.178527914585446]
Normalization techniques are crucial for enhancing Transformer models' performance and stability in time series analysis tasks. We propose UnitNorm, a novel approach that scales input vectors by their norms and modulates attention patterns. UnitNorm's effectiveness is demonstrated across diverse time series analysis tasks, including forecasting, classification, and anomaly detection.
arXiv Detail & Related papers (2024-05-24T19:58:25Z)
Hidden Synergy: $L_1$ Weight Normalization and 1-Path-Norm Regularization [0.0]
We show how PSiLON Net's design drastically simplifies the 1-path-norm. We propose a pruning method to achieve exact sparsity in the final stages of training.
arXiv Detail & Related papers (2024-04-29T21:25:25Z)
Stable Nonconvex-Nonconcave Training via Linear Interpolation [51.668052890249726]
This paper presents a theoretical analysis of linearahead as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear can help by leveraging the theory of nonexpansive operators.
arXiv Detail & Related papers (2023-10-20T12:45:12Z)
Entropy Transformer Networks: A Learning Approach via Tangent Bundle Data Manifold [8.893886200299228]
This paper focuses on an accurate and fast approach for image transformation employed in the design of CNN architectures. A novel Entropy STN (ESTN) is proposed that interpolates on the data manifold distributions. Experiments on challenging benchmarks show that the proposed ESTN can improve predictive accuracy over a range of computer vision tasks.
arXiv Detail & Related papers (2023-07-24T04:21:51Z)
Deterministic Decoupling of Global Features and its Application to Data Analysis [0.0]
We propose a new formalism that is based on defining transformations on submanifolds. Through these transformations we define a normalization that, we demonstrate, allows for decoupling differentiable features. We apply this method in the original data domain and at the output of a filter bank to regression and classification problems based on global descriptors.
arXiv Detail & Related papers (2022-07-05T15:54:39Z)
Global Vision Transformer Pruning with Hessian-Aware Saliency [93.33895899995224]
This work challenges the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage. We derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter that utilizes parameters more efficiently.
arXiv Detail & Related papers (2021-10-10T18:04:59Z)
Efficient Semi-Implicit Variational Inference [65.07058307271329]
We propose an efficient and scalable semi-implicit extrapolational (SIVI) Our method maps SIVI's evidence to a rigorous inference of lower gradient values.
arXiv Detail & Related papers (2021-01-15T11:39:09Z)
Improve Generalization and Robustness of Neural Networks via Weight Scale Shifting Invariant Regularizations [52.493315075385325]
We show that a family of regularizers, including weight decay, is ineffective at penalizing the intrinsic norms of weights for networks with homogeneous activation functions. We propose an improved regularizer that is invariant to weight scale shifting and thus effectively constrains the intrinsic norm of a neural network.
arXiv Detail & Related papers (2020-08-07T02:55:28Z)
Controllable Orthogonalization in Training DNNs [96.1365404059924]
Orthogonality is widely used for training deep neural networks (DNNs) due to its ability to maintain all singular values of the Jacobian close to 1. This paper proposes a computationally efficient and numerically stable orthogonalization method using Newton's iteration (ONI) We show that our method improves the performance of image classification networks by effectively controlling the orthogonality to provide an optimal tradeoff between optimization benefits and representational capacity reduction. We also show that ONI stabilizes the training of generative adversarial networks (GANs) by maintaining the Lipschitz continuity of a network, similar to spectral normalization (
arXiv Detail & Related papers (2020-04-02T10:14:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.