The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers
- URL: http://arxiv.org/abs/2210.06313v2
- Date: Fri, 9 Jun 2023 21:53:43 GMT
- Title: The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in
Transformers
- Authors: Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh
Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, Sanjiv
Kumar
- Abstract summary: This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse.
We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks.
We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
- Score: 59.87030906486969
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper studies the curious phenomenon for machine learning models with
Transformer architectures that their activation maps are sparse. By activation
map we refer to the intermediate output of the multi-layer perceptrons (MLPs)
after a ReLU activation function, and by sparse we mean that on average very
few entries (e.g., 3.0% for T5-Base and 6.3% for ViT-B16) are nonzero for each
input to MLP. Moreover, larger Transformers with more layers and wider MLP
hidden dimensions are sparser as measured by the percentage of nonzero entries.
Through extensive experiments we demonstrate that the emergence of sparsity is
a prevalent phenomenon that occurs for both natural language processing and
vision tasks, on both training and evaluation data, for Transformers of various
configurations, at layers of all depth levels, as well as for other
architectures including MLP-mixers and 2-layer MLPs. We show that sparsity also
emerges using training datasets with random labels, or with random inputs, or
with infinite amount of data, demonstrating that sparsity is not a result of a
specific family of datasets. We discuss how sparsity immediately implies a way
to significantly reduce the FLOP count and improve efficiency for Transformers.
Moreover, we demonstrate perhaps surprisingly that enforcing an even sparser
activation via Top-k thresholding with a small value of k brings a collection
of desired but missing properties for Transformers, namely less sensitivity to
noisy training data, more robustness to input corruptions, and better
calibration for their prediction confidence.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - MLP Can Be A Good Transformer Learner [73.01739251050076]
Self-attention mechanism is the key of the Transformer but often criticized for its computation demands.
This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers.
arXiv Detail & Related papers (2024-04-08T16:40:15Z) - Sparse Binary Transformers for Multivariate Time Series Modeling [1.3965477771846404]
We show that lightweight Compressed Neural Networks can achieve accuracy comparable to dense floating-point Transformers.
Our model achieves favorable results across three time series learning tasks: classification, anomaly detection, and single-step forecasting.
We measure the computational savings of our approach over a range of metrics including parameter count, bit size, and floating point operation (FLOPs) count.
arXiv Detail & Related papers (2023-08-09T00:23:04Z) - U-shaped Transformer: Retain High Frequency Context in Time Series
Analysis [0.5710971447109949]
In this paper, we consider the low-pass characteristics of transformers and try to incorporate the advantages of them.
We introduce patch merge and split operation to extract features with different scales and use larger datasets to fully make use of the transformer backbone.
Our experiments demonstrate that the model performs at an advanced level across multiple datasets with relatively low cost.
arXiv Detail & Related papers (2023-07-18T07:15:26Z) - Towards Data-Efficient Detection Transformers [77.43470797296906]
We show most detection transformers suffer from significant performance drops on small-size datasets.
We empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR.
We introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency.
arXiv Detail & Related papers (2022-03-17T17:56:34Z) - Efficient Language Modeling with Sparse all-MLP [53.81435968051093]
All-MLPs can match Transformers in language modeling, but still lag behind in downstream tasks.
We propose sparse all-MLPs with mixture-of-experts (MoEs) in both feature and input (tokens)
We evaluate its zero-shot in-context learning performance on six downstream tasks, and find that it surpasses Transformer-based MoEs and dense Transformers.
arXiv Detail & Related papers (2022-03-14T04:32:19Z) - Bayesian Transformer Language Models for Speech Recognition [59.235405107295655]
State-of-the-art neural language models (LMs) represented by Transformers are highly complex.
This paper proposes a full Bayesian learning framework for Transformer LM estimation.
arXiv Detail & Related papers (2021-02-09T10:55:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.