Benign Overfitting in Token Selection of Attention Mechanism
- URL: http://arxiv.org/abs/2409.17625v2
- Date: Fri, 31 Jan 2025 13:09:22 GMT
- Title: Benign Overfitting in Token Selection of Attention Mechanism
- Authors: Keitaro Sakamoto, Issei Sato,
- Abstract summary: We study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise.
We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting.
Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting.
- Score: 34.316270145027616
- License:
- Abstract: Attention mechanism is a fundamental component of the transformer model and plays a significant role in its success. However, the theoretical understanding of how attention learns to select tokens is still an emerging area of research. In this work, we study the training dynamics and generalization ability of the attention mechanism under classification problems with label noise. We show that, with the characterization of signal-to-noise ratio (SNR), the token selection of attention mechanism achieves benign overfitting, i.e., maintaining high generalization performance despite fitting label noise. Our work also demonstrates an interesting delayed acquisition of generalization after an initial phase of overfitting. Finally, we provide experiments to support our theoretical analysis using both synthetic and real-world datasets.
Related papers
- On the Emergence of Position Bias in Transformers [59.87743433861665]
This paper introduces a novel graph-theoretic framework to analyze position bias in multi-layer attention.
We quantify how tokens interact with contextual information based on their sequential positions.
Our framework offers a principled foundation for understanding positional biases in transformers.
arXiv Detail & Related papers (2025-02-04T02:53:07Z) - Active-Dormant Attention Heads: Mechanistically Demystifying Extreme-Token Phenomena in LLMs [77.66717051042032]
Practitioners have consistently observed three puzzling phenomena in transformer-based large language models.
These phenomena are characterized by certain so-called "sink tokens" receiving disproportionately high attention weights.
We elucidate the mechanisms behind extreme-token phenomena.
arXiv Detail & Related papers (2024-10-17T17:54:06Z) - CAFO: Feature-Centric Explanation on Time Series Classification [6.079474513317929]
Current explanation methods for MTS mostly focus on time-centric explanations, apt for pinpointing important time periods but less effective in identifying key features.
Our study introduces a novel feature-centric explanation and evaluation framework for MTS, named CAFO.
Our framework's efficacy is validated through extensive empirical analyses on two major public benchmarks and real-world datasets.
arXiv Detail & Related papers (2024-06-03T23:06:45Z) - A phase transition between positional and semantic learning in a solvable model of dot-product attention [30.96921029675713]
Morelinear model dot-product attention is studied as a non-dimensional self-attention layer with trainable and low-dimensional query and key data.
We show that either a positional attention mechanism (with tokens each other based on their respective positions) or a semantic attention mechanism (with tokens tied to each other based their meaning) or a transition from the former to the latter with increasing sample complexity.
arXiv Detail & Related papers (2024-02-06T11:13:54Z) - Naturalness of Attention: Revisiting Attention in Code Language Models [3.756550107432323]
Language models for code such as CodeBERT offer the capability to learn advanced source code representation, but their opacity poses barriers to understanding of captured properties.
This study aims to shed some light on the previously ignored factors of the attention mechanism beyond the attention weights.
arXiv Detail & Related papers (2023-11-22T16:34:12Z) - A Theoretical Understanding of Shallow Vision Transformers: Learning,
Generalization, and Sample Complexity [71.11795737362459]
ViTs with self-attention modules have recently achieved great empirical success in many tasks.
However, theoretical learning generalization analysis is mostly noisy and elusive.
This paper provides the first theoretical analysis of a shallow ViT for a classification task.
arXiv Detail & Related papers (2023-02-12T22:12:35Z) - Guiding Visual Question Answering with Attention Priors [76.21671164766073]
We propose to guide the attention mechanism using explicit linguistic-visual grounding.
This grounding is derived by connecting structured linguistic concepts in the query to their referents among the visual objects.
The resultant algorithm is capable of probing attention-based reasoning models, injecting relevant associative knowledge, and regulating the core reasoning process.
arXiv Detail & Related papers (2022-05-25T09:53:47Z) - SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity.
Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism.
We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z) - Attention or memory? Neurointerpretable agents in space and time [0.0]
We design a model incorporating a self-attention mechanism that implements task-state representations in semantic feature-space.
To evaluate the agent's selective properties, we add a large volume of task-irrelevant features to observations.
In line with neuroscience predictions, self-attention leads to increased robustness to noise compared to benchmark models.
arXiv Detail & Related papers (2020-07-09T15:04:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.