Related papers: On the Role of Attention Masks and LayerNorm in Transformers

On the Role of Attention Masks and LayerNorm in Transformers

URL: http://arxiv.org/abs/2405.18781v1
Date: Wed, 29 May 2024 05:41:28 GMT
Title: On the Role of Attention Masks and LayerNorm in Transformers
Authors: Xinyi Wu, Amir Ajorlou, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie,
Abstract summary: Self-attention is the key mechanism of transformers. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse.
Score: 55.81177251872377
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.

Related papers

On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. We quantify how tokens interact with contextual information based on their sequential positions. Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z)
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection [58.87142367781417]
A naively trained detector tends to favor overfitting to the limited and monotonous fake patterns, causing the feature space to become highly constrained and low-ranked.<n>One potential remedy is incorporating the pre-trained knowledge within the vision foundation models to expand the feature space.<n>By freezing the principal components and adapting only the remained components, we preserve the pre-trained knowledge while learning fake patterns.
arXiv Detail & Related papers (2024-11-23T19:10:32Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Lambda-Skip Connections: the architectural component that prevents Rank Collapse [3.0411373811598112]
This paper extends the theory of rank collapse from transformers to State Space Models (SSMs) We study how a parametrized version of the classic skip connection component, which we call emphlambda-skip connections, provides guarantees for rank collapse prevention. To our knowledge, this is the first study that provides a general guarantee to prevent rank collapse, and that investigates rank collapse in the context of SSMs.
arXiv Detail & Related papers (2024-10-14T15:16:33Z)
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Transformers [3.686808512438363]
This paper examines signal propagation in textitattention-only transformers from a random matrix perspective. We show that a textitspectral gap between the two largest singular values of the attention matrix causes rank collapse in width. We propose a novel, yet simple, practical solution to resolve rank collapse in width by removing the spectral gap.
arXiv Detail & Related papers (2024-10-10T10:34:18Z)
From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients [86.40635601953446]
We study the emergence of low-rank structures across different layers of Modern Large Language Models. We present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE.
arXiv Detail & Related papers (2024-07-15T21:05:20Z)
Self-attention Networks Localize When QK-eigenspectrum Concentrates [9.379890125442335]
Self-attention mechanism prevails in modern machine learning. Two arguments have connected attention localization to the model performances. We show that a small eigenspectrum variance leads attention to be localized.
arXiv Detail & Related papers (2024-02-03T09:35:53Z)
Which Features are Learnt by Contrastive Learning? On the Role of Simplicity Bias in Class Collapse and Feature Suppression [59.97965005675144]
Contrastive learning (CL) has emerged as a powerful technique for representation learning, with or without label supervision. We provide the first unified theoretically rigorous framework to determine textitwhich features are learnt by CL. We present increasing embedding dimensionality and improving the quality of data augmentations as two theoretically motivated solutions.
arXiv Detail & Related papers (2023-05-25T23:37:22Z)
Stabilizing Transformer Training by Preventing Attention Entropy Collapse [56.45313891694746]
We investigate the training dynamics of Transformers by examining the evolution of the attention layers. We show that $sigma$Reparam successfully prevents entropy collapse in the attention layers, promoting more stable training. We conduct experiments with $sigma$Reparam on image classification, image self-supervised learning, machine translation, speech recognition, and language modeling tasks.
arXiv Detail & Related papers (2023-03-11T03:30:47Z)
Signal Propagation in Transformers: Theoretical Perspectives and the Role of Rank Collapse [11.486545294602697]
We shed new light on the causes and effects of rank collapse in Transformers. We show that rank collapse of the tokens' representations hinders training by causing the gradients of the queries and keys to vanish.
arXiv Detail & Related papers (2022-06-07T09:07:24Z)
Attention is Not All You Need: Pure Attention Loses Rank Doubly Exponentially with Depth [48.16156149749371]
This work proposes a new way to understand self-attention networks. We show that their output can be decomposed into a sum of smaller terms. We prove that self-attention possesses a strong inductive bias towards "token"
arXiv Detail & Related papers (2021-03-05T00:39:05Z)

This list is automatically generated from the titles and abstracts of the papers in this site.