Related papers: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer

URL: http://arxiv.org/abs/2506.01115v2
Date: Thu, 12 Jun 2025 20:40:32 GMT
Title: Attention Retrieves, MLP Memorizes: Disentangling Trainable Components in the Transformer
Authors: Yihe Dong, Lorenzo Noci, Mikhail Khodak, Mufan Li,
Abstract summary: The Transformer architecture is central to the success of modern Large Language Models.<n>While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it.
Score: 19.36946128510059
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The Transformer architecture is central to the success of modern Large Language Models (LLMs), in part due to its surprising ability to perform a wide range of algorithmic tasks -- including mathematical reasoning, memorization, and retrieval -- using only gradient-based training on next-token prediction. While the core component of a Transformer is the self-attention mechanism, we question how much, and which aspects, of the performance gains can be attributed to it. To this end, we compare standard Transformers to variants in which either the multi-layer perceptron (MLP) layers or the attention projectors (queries and keys) are frozen at initialization. To further isolate the contribution of attention, we introduce MixiT -- the Mixing Transformer -- a simplified, principled model in which the attention coefficients are entirely random and fixed at initialization, eliminating any input-dependent computation or learning in attention. Surprisingly, we find that MixiT matches the performance of fully trained Transformers on various algorithmic tasks, especially those involving basic arithmetic or focusing heavily on memorization. For retrieval-based tasks, we observe that having input-dependent attention coefficients is consistently beneficial, while MixiT underperforms. We attribute this failure to its inability to form specialized circuits such as induction heads -- a specific circuit known to be crucial for learning and exploiting repeating patterns in input sequences. Even more interestingly, we find that attention with frozen key and query projectors is not only able to form induction heads, but can also perform competitively on language modeling. Our results underscore the importance of architectural heterogeneity, where distinct components contribute complementary inductive biases crucial for solving different classes of tasks.

Related papers

Provable In-Context Learning of Nonlinear Regression with Transformers [58.018629320233174]
In-context learning (ICL) is the ability to perform unseen tasks using task-specific prompts without updating parameters.<n>Recent research has actively explored the training dynamics behind ICL.<n>This paper investigates more complex nonlinear regression tasks, aiming to uncover how transformers acquire in-context learning capabilities.
arXiv Detail & Related papers (2025-07-28T00:09:28Z)
On the Existence of Universal Simulators of Attention [17.01811978811789]
We present solutions to identically replicate attention outputs and the underlying elementary matrix and activation operations via RASP.<n>Our proofs, for the first time, show the existence of an algorithmically achievable data-agnostic solution, previously known to be approximated only by learning.
arXiv Detail & Related papers (2025-06-23T15:15:25Z)
Transformers as Unsupervised Learning Algorithms: A study on Gaussian Mixtures [10.970776446566909]
This paper investigates the capabilities of transformers in solving unsupervised learning problems.<n>We propose a transformer-based learning framework called TGMM that simultaneously learns to solve multiple GMM tasks.<n>We prove that transformers can approximate both the EM algorithm and a core component of spectral methods.
arXiv Detail & Related papers (2025-05-17T09:02:18Z)
How Transformers Learn Regular Language Recognition: A Theoretical Study on Training Dynamics and Implicit Bias [48.9399496805422]
We focus on two representative tasks in the category of regular language recognition, known as even pairs' and parity check'<n>Our goal is to explore how a one-layer transformer, consisting of an attention layer followed by a linear layer, learns to solve these tasks.
arXiv Detail & Related papers (2025-05-02T00:07:35Z)
On-Chip Learning via Transformer In-Context Learning [0.9353041869660692]
Self-attention mechanism requires transferring prior token projections from the main memory at each time step. We present a neuromorphic decoder-only transformer model that utilizes an on-chip plasticity processor to compute self-attention.
arXiv Detail & Related papers (2024-10-11T10:54:09Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues.<n>Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process.<n>We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z)
Adaptive Fourier Neural Operators: Efficient Token Mixers for Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain. AFNO is based on a principled foundation of operator learning. It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z)
SparseBERT: Rethinking the Importance Analysis in Self-attention [107.68072039537311]
Transformer-based models are popular for natural language processing (NLP) tasks due to its powerful capacity. Attention map visualization of a pre-trained model is one direct method for understanding self-attention mechanism. We propose a Differentiable Attention Mask (DAM) algorithm, which can be also applied in guidance of SparseBERT design.
arXiv Detail & Related papers (2021-02-25T14:13:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.