Related papers: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers

URL: http://arxiv.org/abs/2411.12118v1
Date: Mon, 18 Nov 2024 23:12:13 GMT
Title: Mechanism and Emergence of Stacked Attention Heads in Multi-Layer Transformers
Authors: Tiberiu Musat,
Abstract summary: I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. I find that successful learning occurs only under the presence of an implicit curriculum.
Score: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In this paper, I introduce the retrieval problem, a simple reasoning task that can be solved only by transformers with a minimum number of layers. The task has an adjustable difficulty that can further increase the required number of layers to any arbitrary value. I demonstrate that large language models can solve the task under different prompting formulations without any fine-tuning. To understand how transformers solve the retrieval problem, I train several transformers on a minimal formulation. I find that successful learning occurs only under the presence of an implicit curriculum. I uncover the learned mechanisms by studying the attention maps in the trained transformers. I also study the training process, uncovering that attention heads always emerge in a specific sequence.

Related papers

One-Layer Transformer Provably Learns One-Nearest Neighbor In Context [48.4979348643494]
We study the capability of one-layer transformers learning the one-nearest neighbor rule. A single softmax attention layer can successfully learn to behave like a one-nearest neighbor.
arXiv Detail & Related papers (2024-11-16T16:12:42Z)
Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective. We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state. Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z)
In-Context Learning with Representations: Contextual Generalization of Trained Transformers [66.78052387054593]
In-context learning (ICL) refers to a capability of pretrained large language models, which can learn a new task given a few examples during inference. This paper investigates the training dynamics of transformers by gradient descent through the lens of non-linear regression tasks.
arXiv Detail & Related papers (2024-08-19T16:47:46Z)
When Can Transformers Count to n? [48.32323039293186]
We show that if the dimension of the transformer state is linear in the context length, this task can be solved. We provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our results demonstrate the importance of understanding how transformers can solve simple tasks.
arXiv Detail & Related papers (2024-07-21T13:31:02Z)
What Can Transformer Learn with Varying Depth? Case Studies on Sequence Learning Tasks [15.874604623294427]
We show a transformer with only one attention layer can excel in memorization but falls short in other tasks. We identify a class of simple operations that a single attention layer can execute, and show that the complex tasks can be approached as the combinations of these simple operations.
arXiv Detail & Related papers (2024-04-02T02:45:12Z)
How Transformers Learn Causal Structure with Gradient Descent [44.31729147722701]
Self-attention allows transformers to encode causal structure. We introduce an in-context learning task that requires learning latent causal structure. We show that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
arXiv Detail & Related papers (2024-02-22T17:47:03Z)
Eureka-Moments in Transformers: Multi-Step Tasks Reveal Softmax Induced Optimization Problems [27.681141346132286]
We study rapid improvements of the training loss in transformers when being confronted with multi-step decision tasks. We use synthetic tasks to study the problem in detail, but the leaps in performance can be observed also for language modeling and in-context learning. We find connections and show that ways to improve on the synthetic multi-step tasks can be used to improve the training of language modeling and ICL.
arXiv Detail & Related papers (2023-10-19T17:55:06Z)
Differentiable Subset Pruning of Transformer Heads [71.7904179689271]
We introduce a new head pruning technique that we term differentiable subset pruning. We show that differentiable subset pruning performs comparably or better than previous works while offering precise control of the sparsity level.
arXiv Detail & Related papers (2021-08-10T13:08:34Z)
Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language. We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer. We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z)
Multi-branch Attentive Transformer [152.07840447196384]
We propose a simple yet effective variant of Transformer called multi-branch attentive Transformer. The attention layer is the average of multiple branches and each branch is an independent multi-head attention layer. Experiments on machine translation, code generation and natural language understanding demonstrate that such a simple variant of Transformer brings significant improvements.
arXiv Detail & Related papers (2020-06-18T04:24:28Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.