Related papers: Why "classic" Transformers are shallow and how to make them go deep

Why "classic" Transformers are shallow and how to make them go deep

URL: http://arxiv.org/abs/2312.06182v2
Date: Fri, 2 Feb 2024 02:53:22 GMT
Title: Why "classic" Transformers are shallow and how to make them go deep
Authors: Yueyao Yu, Yin Zhang
Abstract summary: Key innovation in Transformer is a Self-Attention mechanism designed to capture contextual information. extending the original Transformer design to models of greater depth has proven exceedingly challenging. We propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly.
Score: 4.520356456308492
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Since its introduction in 2017, Transformer has emerged as the leading neural network architecture, catalyzing revolutionary advancements in many AI disciplines. The key innovation in Transformer is a Self-Attention (SA) mechanism designed to capture contextual information. However, extending the original Transformer design to models of greater depth has proven exceedingly challenging, if not impossible. Even though various modifications have been proposed in order to stack more layers of SA mechanism into deeper models, a full understanding of this depth problem remains lacking. In this paper, we conduct a comprehensive investigation, both theoretically and empirically, to substantiate the claim that the depth problem is caused by \emph{token similarity escalation}; that is, tokens grow increasingly alike after repeated applications of the SA mechanism. Our analysis reveals that, driven by the invariant leading eigenspace and large spectral gaps of attention matrices, token similarity provably escalates at a linear rate. Based on the gained insight, we propose a new strategy of surgically removing excessive similarity in contrast to the existing approach of diminishing the SA mechanism explicitly or implicitly (such as in pre-norm transformers). Preliminary experimental results confirm the effectiveness of the proposed strategy in small-scale post-norm Transformer models.

Related papers

On the Robustness of Transformers against Context Hijacking for Linear Classification [26.1838836907147]
Transformer-based Large Language Models (LLMs) have demonstrated powerful in-context learning capabilities. They can be disrupted by factually correct context, a phenomenon known as context hijacking. We show that a well-trained deeper transformer can achieve higher robustness, which aligns with empirical observations.
arXiv Detail & Related papers (2025-02-21T17:31:00Z)
Interpreting Affine Recurrence Learning in GPT-style Transformers [54.01174470722201]
In-context learning allows GPT-style transformers to generalize during inference without modifying their weights. This paper focuses specifically on their ability to learn and predict affine recurrences as an ICL task. We analyze the model's internal operations using both empirical and theoretical approaches.
arXiv Detail & Related papers (2024-10-22T21:30:01Z)
What Does It Mean to Be a Transformer? Insights from a Theoretical Hessian Analysis [8.008567379796666]
The Transformer architecture has inarguably revolutionized deep learning. At its core, the attention block differs in form and functionality from most other architectural components in deep learning. The root causes behind these outward manifestations, and the precise mechanisms that govern them, remain poorly understood.
arXiv Detail & Related papers (2024-10-14T18:15:02Z)
Unveil Benign Overfitting for Transformer in Vision: Training Dynamics, Convergence, and Generalization [88.5582111768376]
We study the optimization of a Transformer composed of a self-attention layer with softmax followed by a fully connected layer under gradient descent on a certain data distribution model. Our results establish a sharp condition that can distinguish between the small test error phase and the large test error regime, based on the signal-to-noise ratio in the data model.
arXiv Detail & Related papers (2024-09-28T13:24:11Z)
Repeat After Me: Transformers are Better than State Space Models at Copying [53.47717661441142]
We show that while generalized state space models are promising in terms of inference-time efficiency, they are limited compared to transformer models on tasks that require copying from the input context.
arXiv Detail & Related papers (2024-02-01T21:44:11Z)
On the Convergence of Encoder-only Shallow Transformers [62.639819460956176]
We build the global convergence theory of encoder-only shallow Transformers under a realistic setting. Our results can pave the way for a better understanding of modern Transformers, particularly on training dynamics.
arXiv Detail & Related papers (2023-11-02T20:03:05Z)
Attention Is Not All You Need Anymore [3.9693969407364427]
We propose a family of drop-in replacements for the self-attention mechanism in the Transformer. Experimental results show that replacing the self-attention mechanism with the SHE evidently improves the performance of the Transformer. The proposed Extractors have the potential or are able to run faster than the self-attention mechanism.
arXiv Detail & Related papers (2023-08-15T09:24:38Z)
Transformers from an Optimization Perspective [24.78739299952529]
We study the problem of finding an energy function underlying the Transformer model. By finding such a function, we can reinterpret Transformers as the unfolding of an interpretable optimization process. This work contributes to our intuition and understanding of Transformers, while potentially laying the ground-work for new model designs.
arXiv Detail & Related papers (2022-05-27T10:45:15Z)
XAI for Transformers: Better Explanations through Conservative Propagation [60.67748036747221]
We show that the gradient in a Transformer reflects the function only locally, and thus fails to reliably identify the contribution of input features to the prediction. Our proposal can be seen as a proper extension of the well-established LRP method to Transformers.
arXiv Detail & Related papers (2022-02-15T10:47:11Z)
SDTP: Semantic-aware Decoupled Transformer Pyramid for Dense Image Prediction [33.29925021875922]
We propose a novel Semantic-aware Decoupled Transformer Pyramid (SDTP) for dense image prediction, consisting of Intra-level Semantic Promotion (ISP), Cross-level Decoupled Interaction (CDI) and Attention Refinement Function (ARF) ISP explores the semantic diversity in different receptive space. CDI builds the global attention and interaction among different levels in decoupled space which also solves the problem of heavy computation. Experimental results demonstrate the validity and generality of the proposed method, which outperforms the state-of-the-art by a significant margin in dense image prediction tasks.
arXiv Detail & Related papers (2021-09-18T16:29:14Z)
Transformers with Competitive Ensembles of Independent Mechanisms [97.93090139318294]
We propose a new Transformer layer which divides the hidden representation and parameters into multiple mechanisms, which only exchange information through attention. We study TIM on a large-scale BERT model, on the Image Transformer, and on speech enhancement and find evidence for semantically meaningful specialization as well as improved performance.
arXiv Detail & Related papers (2021-02-27T21:48:46Z)
Dissecting Lottery Ticket Transformers: Structural and Behavioral Study of Sparse Neural Machine Translation [0.0]
Recent work on the lottery ticket hypothesis has produced highly sparse Transformers for NMT while maintaining BLEU. By probing Transformers with more and more low-magnitude weights pruned away, we find that complex semantic information is first to be degraded. Analysis of internal activations reveals that higher layers diverge most over the course of pruning, gradually becoming less complex than their dense counterparts.
arXiv Detail & Related papers (2020-09-17T02:08:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.