Related papers: Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure

URL: http://arxiv.org/abs/2405.20671v2
Date: Wed, 30 Oct 2024 16:50:43 GMT
Title: Position Coupling: Improving Length Generalization of Arithmetic Transformers Using Task Structure
Authors: Hanseul Cho, Jaeyoung Cha, Pranjal Awasthi, Srinadh Bhojanapalli, Anupam Gupta, Chulhee Yun,
Abstract summary: We propose position coupling, a simple yet effective method that embeds the structure of the tasks into the positional encoding of a Transformer. We show that our models trained on 1 to 30-digit additions can generalize up to 200-digit additions. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.
Score: 42.89515104473087
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Even for simple arithmetic tasks like integer addition, it is challenging for Transformers to generalize to longer sequences than those encountered during training. To tackle this problem, we propose position coupling, a simple yet effective method that directly embeds the structure of the tasks into the positional encoding of a (decoder-only) Transformer. Taking a departure from the vanilla absolute position mechanism assigning unique position IDs to each of the tokens, we assign the same position IDs to two or more "relevant" tokens; for integer addition tasks, we regard digits of the same significance as in the same position. On the empirical side, we show that with the proposed position coupling, our models trained on 1 to 30-digit additions can generalize up to 200-digit additions (6.67x of the trained length). On the theoretical side, we prove that a 1-layer Transformer with coupled positions can solve the addition task involving exponentially many digits, whereas any 1-layer Transformer without positional information cannot entirely solve it. We also demonstrate that position coupling can be applied to other algorithmic tasks such as Nx2 multiplication and a two-dimensional task.

Related papers

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) [1.157192696857674]
This paper contributes to the study of the expressive capacity of transformers. We focus on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument.
arXiv Detail & Related papers (2025-03-28T01:40:23Z)
Arithmetic Transformers Can Length-Generalize in Both Operand Length and Count [19.148785141454642]
Transformers often struggle with length generalization, meaning they fail to generalize to sequences longer than those encountered during training. In this work, we achieve approximately 2-3x length generalization on both tasks, which is the first such achievement in arithmetic Transformers.
arXiv Detail & Related papers (2024-10-21T08:49:51Z)
Explicitly Encoding Structural Symmetry is Key to Length Generalization in Arithmetic Tasks [32.81985604969825]
We show that Transformers fail to generalize over length on basic arithmetic tasks such as addition and multiplication. A major reason behind this failure is the vast difference in structure between numbers and text. We propose to encode these semantics explicitly into the model via modified number formatting and custom positional encodings.
arXiv Detail & Related papers (2024-06-04T02:00:07Z)
Transformers Can Do Arithmetic with the Right Embeddings [75.66545271398704]
We show how to improve the performance of transformers on arithmetic tasks. We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance. These gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
arXiv Detail & Related papers (2024-05-27T17:49:18Z)
Carrying over algorithm in transformers [0.0]
carrying over algorithm consists of two tasks: adding digits in the same position and carrying over a one whenever necessary. We study how transformer models implement this algorithm and how the two aforementioned tasks are allocated to different parts of the network.
arXiv Detail & Related papers (2024-01-15T22:36:11Z)
Positional Description Matters for Transformers Arithmetic [58.4739272381373]
Transformers often falter on arithmetic tasks despite their vast capabilities. We propose several ways to fix the issue, either by modifying the positional encoding directly, or by modifying the representation of the arithmetic task to leverage standard positional encoding differently.
arXiv Detail & Related papers (2023-11-22T00:31:01Z)
LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction [31.895986544484206]
We introduce Layer- Interactions in Transformers (LAIT) Within LAIT, segmented inputs are first encoded independently, and then jointly. We find LAIT able to reduce 30-50% of the attention FLOPs on many tasks, while preserving high accuracy.
arXiv Detail & Related papers (2023-05-31T06:09:59Z)
Inducing Transformer's Compositional Generalization Ability via Auxiliary Sequence Prediction Tasks [86.10875837475783]
Systematic compositionality is an essential mechanism in human language, allowing the recombination of known parts to create novel expressions. Existing neural models have been shown to lack this basic ability in learning symbolic structures. We propose two auxiliary sequence prediction tasks that track the progress of function and argument semantics.
arXiv Detail & Related papers (2021-09-30T16:41:19Z)
Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z)
Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples. In this work we investigate few-shot learning in the setting where the data points are sequences of tokens. We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.