Related papers: On the Emergence of Position Bias in Transformers

On the Emergence of Position Bias in Transformers

URL: http://arxiv.org/abs/2502.01951v1
Date: Tue, 04 Feb 2025 02:53:07 GMT
Title: On the Emergence of Position Bias in Transformers
Authors: Xinyi Wu, Yifei Wang, Stefanie Jegelka, Ali Jadbabaie,
Abstract summary: This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention.<n>We quantify how tokens interact with contextual information based on their sequential positions.<n>Our framework offers a principled foundation for understanding positional biases in transformers.
Score: 59.87743433861665
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers -- coupled with the causal mask -- leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

Related papers

Gaze-Guided Learning: Avoiding Shortcut Bias in Visual Classification [3.1208151315473622]
We introduce Gaze-CIFAR-10, a human gaze time-series dataset, along with a dual-sequence gaze encoder. In parallel, a Vision Transformer (ViT) is employed to learn the sequential representation of image content. Our framework integrates human gaze priors with machine-derived visual sequences, effectively correcting inaccurate localization in image feature representations.
arXiv Detail & Related papers (2025-04-08T00:40:46Z)
"Principal Components" Enable A New Language of Images [79.45806370905775]
We introduce a novel visual tokenization framework that embeds a provable PCA-like structure into the latent token space. Our approach achieves state-of-the-art reconstruction performance and enables better interpretability to align with the human vision system.
arXiv Detail & Related papers (2025-03-11T17:59:41Z)
Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models. These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights. We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z)
Mind the Gap: a Spectral Analysis of Rank Collapse and Signal Propagation in Attention Layers [3.686808512438363]
Alternatives to softmax-based attention are being due to its tendency to hinder effective information flow.<n>We conduct a rigorous analysis to uncover a spectral gap between the two largest singular gradients of the attention matrix.<n>We propose a novel simple practical solution to rank collapse in width by removing the outlier(s)
arXiv Detail & Related papers (2024-10-10T10:34:18Z)
Benign Overfitting in Token Selection of Attention Mechanism [34.316270145027616]
We study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting.
arXiv Detail & Related papers (2024-09-26T08:20:05Z)
Unveiling and Controlling Anomalous Attention Distribution in Transformers [8.456319173083315]
Waiver phenomenon allows elements to absorb excess attention without affecting their contribution to information. In specific models, due to differences in positional encoding and attention patterns, we have found that the selection of waiver elements by the model can be categorized into two methods.
arXiv Detail & Related papers (2024-06-26T11:53:35Z)
What Improves the Generalization of Graph Transformers? A Theoretical Dive into the Self-attention and Positional Encoding [67.59552859593985]
Graph Transformers, which incorporate self-attention and positional encoding, have emerged as a powerful architecture for various graph learning tasks. This paper introduces first theoretical investigation of a shallow Graph Transformer for semi-supervised classification.
arXiv Detail & Related papers (2024-06-04T05:30:16Z)
How Transformers Learn Diverse Attention Correlations in Masked Vision Pretraining [66.08606211686339]
We provide the first end-to-end theoretical guarantee of learning one-layer transformers in masked reconstruction self-supervised pretraining. On the conceptual side, we posit a mechanism of how transformers trained with masked vision pretraining objectives produce empirically observed local and diverse attention patterns. On the technical side, our end-to-end characterization of training dynamics in softmax-attention models simultaneously accounts for input and position embeddings.
arXiv Detail & Related papers (2024-03-04T17:24:03Z)
Revisiting Over-smoothing in BERT from the Perspective of Graph [111.24636158179908]
Recently over-smoothing phenomenon of Transformer-based models is observed in both vision and language fields. We find that layer normalization plays a key role in the over-smoothing issue of Transformer-based models. We consider hierarchical fusion strategies, which combine the representations from different layers adaptively to make the output more diverse.
arXiv Detail & Related papers (2022-02-17T12:20:52Z)
Multi-Head Attention: Collaborate Instead of Concatenate [85.71058762269374]
We propose a collaborative multi-head attention layer that enables heads to learn shared projections. Experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision.
arXiv Detail & Related papers (2020-06-29T20:28:52Z)
Telling BERT's full story: from Local Attention to Global Aggregation [14.92157586545743]
We take a deep look into the behavior of self-attention heads in the transformer architecture. We show that attention distributions can nevertheless provide insights into the local behavior of attention heads.
arXiv Detail & Related papers (2020-04-10T01:36:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.