Related papers: Of Non-Linearity and Commutativity in BERT

Of Non-Linearity and Commutativity in BERT

URL: http://arxiv.org/abs/2101.04547v3
Date: Thu, 14 Jan 2021 10:23:01 GMT
Title: Of Non-Linearity and Commutativity in BERT
Authors: Sumu Zhao, Damian Pascual, Gino Brunner, Roger Wattenhofer
Abstract summary: We study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner. Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections.
Score: 8.295319152986316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work we provide new insights into the transformer architecture, and in particular, its best-known variant, BERT. First, we propose a method to measure the degree of non-linearity of different elements of transformers. Next, we focus our investigation on the feed-forward networks (FFN) inside transformers, which contain 2/3 of the model parameters and have so far not received much attention. We find that FFNs are an inefficient yet important architectural element and that they cannot simply be replaced by attention blocks without a degradation in performance. Moreover, we study the interactions between layers in BERT and show that, while the layers exhibit some hierarchical structure, they extract features in a fuzzy manner. Our results suggest that BERT has an inductive bias towards layer commutativity, which we find is mainly due to the skip connections. This provides a justification for the strong performance of recurrent and weight-shared transformer models.

Related papers

TinyFusion: Diffusion Transformers Learned Shallow [52.96232442322824]
Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization. We present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$times$ speedup with an FID score of 2.86.
arXiv Detail & Related papers (2024-12-02T07:05:39Z)
FuseGPT: Learnable Layers Fusion of Generative Pre-trained Transformers [30.88764351013966]
Generative Pre-trained Transformers (GPTs) have demonstrated remarkable performance across diverse domains. Recent works observe the redundancy across the transformer blocks and develop compression methods by structured pruning of the unimportant blocks. We propose FuseGPT, a novel methodology to recycle the pruned transformer blocks to further recover the model performance.
arXiv Detail & Related papers (2024-11-21T09:49:28Z)
Skip-Layer Attention: Bridging Abstract and Detailed Dependencies in Transformers [56.264673865476986]
This paper introduces Skip-Layer Attention (SLA) to enhance Transformer models. SLA improves the model's ability to capture dependencies between high-level abstract features and low-level details. Our implementation extends the Transformer's functionality by enabling queries in a given layer to interact with keys and values from both the current layer and one preceding layer.
arXiv Detail & Related papers (2024-06-17T07:24:38Z)
MABViT -- Modified Attention Block Enhances Vision Transformers [0.0]
We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implement the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset.
arXiv Detail & Related papers (2023-12-03T09:00:31Z)
ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative. We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z)
Deep Transformers without Shortcuts: Modifying Self-attention for Faithful Signal Propagation [105.22961467028234]
Skip connections and normalisation layers are ubiquitous for the training of Deep Neural Networks (DNNs) Recent approaches such as Deep Kernel Shaping have made progress towards reducing our reliance on them. But these approaches are incompatible with the self-attention layers present in transformers.
arXiv Detail & Related papers (2023-02-20T21:26:25Z)
The Lazy Neuron Phenomenon: On Emergence of Activation Sparsity in Transformers [59.87030906486969]
This paper studies the curious phenomenon for machine learning models with Transformer architectures that their activation maps are sparse. We show that sparsity is a prevalent phenomenon that occurs for both natural language processing and vision tasks. We discuss how sparsity immediately implies a way to significantly reduce the FLOP count and improve efficiency for Transformers.
arXiv Detail & Related papers (2022-10-12T15:25:19Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction [33.29925021875922]
We propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF) ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation. Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks.
arXiv Detail & Related papers (2021-09-18T16:29:14Z)
Effects of Parameter Norm Growth During Transformer Training: Inductive Bias from Gradient Descent [44.44543743806831]
We study the tendency for transformer parameters to grow in magnitude while saturated between these norms during training. As the parameters grow in magnitude, we prove that the network approximates a discretized network with saturated activation functions. Our results suggest saturation is a new characterization of an inductive bias implicit in GD of particular interest for NLP.
arXiv Detail & Related papers (2020-10-19T17:40:38Z)
Rewiring the Transformer with Depth-Wise LSTMs [55.50278212605607]
We present a Transformer with depth-wise LSTMs connecting cascading Transformer layers and sub-layers. Experiments with the 6-layer Transformer show significant BLEU improvements in both WMT 14 English-German / French tasks and the OPUS-100 many-to-many multilingual NMT task.
arXiv Detail & Related papers (2020-07-13T09:19:34Z)
Applying the Transformer to Character-level Transduction [68.91664610425114]
The transformer has been shown to outperform recurrent neural network-based sequence-to-sequence models in various word-level NLP tasks. We show that with a large enough batch size, the transformer does indeed outperform recurrent models for character-level tasks.
arXiv Detail & Related papers (2020-05-20T17:25:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.