Carrying over algorithm in transformers
- URL: http://arxiv.org/abs/2401.07993v2
- Date: Wed, 17 Jan 2024 16:02:27 GMT
- Title: Carrying over algorithm in transformers
- Authors: Jorrit Kruthoff
- Abstract summary: carrying over algorithm consists of two tasks: adding digits in the same position and carrying over a one whenever necessary.
We study how transformer models implement this algorithm and how the two aforementioned tasks are allocated to different parts of the network.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Addition is perhaps one of the simplest arithmetic tasks one can think of and
is usually performed using the carrying over algorithm. This algorithm consists
of two tasks: adding digits in the same position and carrying over a one
whenever necessary. We study how transformer models implement this algorithm
and how the two aforementioned tasks are allocated to different parts of the
network. We first focus on two-layer encoder-only models and show that the
carrying over algorithm is implemented in a modular fashion. The first layer is
mostly responsible for adding digits in the same position. The second layer
first decides, in the attention, which positions need a carried one or not, and
then performs the carrying of the one in the final MLP. We provide a simple way
of precisely identifying which neurons are responsible for that task. This
implementation of the carrying over algorithm occurs across a range of
hyperparameters for two as well as three-layer models. For small decoder-only
models, we observe the same implementation and provide suggestive evidence for
its existence in three 7B large language models.
Related papers
- Algorithmic Capabilities of Random Transformers [49.73113518329544]
We investigate what functions can be learned by randomly transformers in which only the embedding layers are optimized.
We find that these random transformers can perform a wide range of meaningful algorithmic tasks.
Our results indicate that some algorithmic capabilities are present in transformers even before these models are trained.
arXiv Detail & Related papers (2024-10-06T06:04:23Z) - Deciphering Movement: Unified Trajectory Generation Model for Multi-Agent [53.637837706712794]
We propose a Unified Trajectory Generation model, UniTraj, that processes arbitrary trajectories as masked inputs.
Specifically, we introduce a Ghost Spatial Masking (GSM) module embedded within a Transformer encoder for spatial feature extraction.
We benchmark three practical sports game datasets, Basketball-U, Football-U, and Soccer-U, for evaluation.
arXiv Detail & Related papers (2024-05-27T22:15:23Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - The Clock and the Pizza: Two Stories in Mechanistic Explanation of
Neural Networks [59.26515696183751]
We show that algorithm discovery in neural networks is sometimes more complex.
We show that even simple learning problems can admit a surprising diversity of solutions.
arXiv Detail & Related papers (2023-06-30T17:59:13Z) - Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq
Models [16.49601740473416]
We explore recipes to improve training efficiency by initializing one model from the other.
Using an encoder to warm-start seq2seq training, we show that we can match task performance of a from-scratch seq2seq model.
arXiv Detail & Related papers (2023-06-14T21:41:52Z) - Towards Better Out-of-Distribution Generalization of Neural Algorithmic
Reasoning Tasks [51.8723187709964]
We study the OOD generalization of neural algorithmic reasoning tasks.
The goal is to learn an algorithm from input-output pairs using deep neural networks.
arXiv Detail & Related papers (2022-11-01T18:33:20Z) - Predicting Attention Sparsity in Transformers [0.9786690381850356]
We propose Sparsefinder, a model trained to identify the sparsity pattern of entmax attention before computing it.
Our work provides a new angle to study model efficiency by doing extensive analysis of the tradeoff between the sparsity and recall of the predicted attention graph.
arXiv Detail & Related papers (2021-09-24T20:51:21Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.