Related papers: Two Heads Are Better than One: Simulating Large Transformers with Small Ones

Two Heads Are Better than One: Simulating Large Transformers with Small Ones

URL: http://arxiv.org/abs/2506.12220v2
Date: Thu, 19 Jun 2025 02:24:00 GMT
Title: Two Heads Are Better than One: Simulating Large Transformers with Small Ones
Authors: Hantao Yu, Josh Alman,
Abstract summary: We show that transformers with long input sequences can be efficiently simulated by transformers that can only take short input sequences.<n>We then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.
Score: 3.514389461266844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The quadratic complexity of self-attention prevents transformers from scaling effectively to long input sequences. On the other hand, modern GPUs and other specialized hardware accelerators are well-optimized for processing small input sequences in transformers during both training and inference. A natural question arises: can we take advantage of the efficiency of small transformers to deal with long input sequences? In this paper, we show that transformers with long input sequences (large transformers) can be efficiently simulated by transformers that can only take short input sequences (small transformers). Specifically, we prove that any transformer with input length $N$ can be efficiently simulated by only $O((N/M)^2)$ transformers with input length $M \ll N$, and that this cannot be improved in the worst case. However, we then prove that in various natural scenarios including average-case inputs, sliding window masking and attention sinks, the optimal number $O(N/M)$ of small transformers suffice.

Related papers

Constant Bit-size Transformers Are Turing Complete [8.38684825915246]
We prove that any Turing machine running on inputs of arbitrary length can be simulated by a constant bit-size transformer.<n>Our approach relies on simulating Post machines, a Turing-complete computational model.
arXiv Detail & Related papers (2025-05-22T02:45:38Z)
Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) [1.157192696857674]
This paper contributes to the study of the expressive capacity of transformers.<n>We focus on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument.
arXiv Detail & Related papers (2025-03-28T01:40:23Z)
Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds. We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z)
Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [8.908747084128397]
We introduce the temporal counting logic $textsfK_textt$[#] alongside the RASP variant $textsfC-RASP$.<n>We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers with unbounded input size.
arXiv Detail & Related papers (2024-04-05T20:36:30Z)
Chain of Thought Empowers Transformers to Solve Inherently Serial Problems [57.58801785642868]
Chain of thought (CoT) is a highly effective method to improve the accuracy of large language models (LLMs) on arithmetics and symbolic reasoning tasks. This work provides a theoretical understanding of the power of CoT for decoder-only transformers through the lens of expressiveness.
arXiv Detail & Related papers (2024-02-20T10:11:03Z)
Transformer with a Mixture of Gaussian Keys [31.91701434633319]
Multi-head attention is a driving force behind state-of-the-art transformers. Transformer-MGK replaces redundant heads in transformers with a mixture of keys at each head. Compared to its conventional transformer counterpart, Transformer-MGK accelerates training and inference, has fewer parameters, and requires less FLOPs to compute.
arXiv Detail & Related papers (2021-10-16T23:43:24Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation. A linear-complexity recurrent variant has proven well suited for autoregressive generation. This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z)
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention [22.228028613802174]
Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, they are prohibitively slow for very long sequences. We make use of the associativity property of matrix products to reduce the complexity from $mathcalOleft(N2right)$ to $mathcalOleft(Nright)$, where $N$ is the sequence length. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.
arXiv Detail & Related papers (2020-06-29T17:55:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.