Related papers: When Can Transformers Count to n?

When Can Transformers Count to n?

URL: http://arxiv.org/abs/2407.15160v2
Date: Mon, 7 Oct 2024 13:19:53 GMT
Title: When Can Transformers Count to n?
Authors: Gilad Yehudai, Haim Kaplan, Asma Ghandeharioun, Mor Geva, Amir Globerson,
Abstract summary: We show that if the dimension of the transformer state is linear in the context length, this task can be solved. We provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our results demonstrate the importance of understanding how transformers can solve simple tasks.
Score: 48.32323039293186
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models based on the transformer architectures can solve highly complex tasks. But are there simple tasks that such models cannot solve? Here we focus on very simple counting tasks, that involve counting how many times a token in the vocabulary have appeared in a string. We show that if the dimension of the transformer state is linear in the context length, this task can be solved. However, the solution we propose does not scale beyond this limit, and we provide theoretical arguments for why it is likely impossible for a size limited transformer to implement this task. Our empirical results demonstrate the same phase-transition in performance, as anticipated by the theoretical argument. Our results demonstrate the importance of understanding how transformers can solve simple tasks.

Related papers

Concise One-Layer Transformers Can Do Function Evaluation (Sometimes) [1.157192696857674]
This paper contributes to the study of the expressive capacity of transformers. We focus on their ability to perform the fundamental computational task of evaluating an arbitrary function from $[n]$ to $[n]$ at a given argument.
arXiv Detail & Related papers (2025-03-28T01:40:23Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks. We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z)
Separations in the Representational Capabilities of Transformers and Recurrent Architectures [27.783705012503237]
We analyze the differences in the representational capabilities of Transformers and RNNs across several tasks of practical relevance. We show that a one-layer Transformer of logarithmic width can perform index lookup, whereas an RNN requires a hidden state of linear size. We also show that a log-size two-layer Transformer can implement the nearest neighbor algorithm in its forward pass.
arXiv Detail & Related papers (2024-06-13T17:31:30Z)
Does learning the right latent variables necessarily improve in-context learning? [13.828665019247444]
Large autoregressive models like Transformers can solve tasks through in-context learning (ICL) without learning new weights. In this paper, we investigate the effect of explicitly inferring task latents. We find little discernible difference between the two; biasing towards task-relevant latent variables does not lead to better out-of-distribution performance.
arXiv Detail & Related papers (2024-05-29T15:06:10Z)
The Expressive Power of Transformers with Chain of Thought [29.839710738657203]
In practice, transformers can be improved by allowing them to use a "chain of thought" or "scratchpad" We show that the answer is yes, but the amount of increase depends crucially on the amount of intermediate generation. Our results also imply that linear steps keep transformer decoders within context-sensitive languages.
arXiv Detail & Related papers (2023-10-11T22:35:18Z)
Towards Revealing the Mystery behind Chain of Thought: A Theoretical Perspective [39.47116013338394]
Chain-of-Thought prompting (CoT) can dramatically improve the performance of Large Language Models (LLMs) We show that CoT can handle a general class of decision-making problems known as Dynamic Programming.
arXiv Detail & Related papers (2023-05-24T17:59:21Z)
Perceiver-Actor: A Multi-Task Transformer for Robotic Manipulation [52.94101901600948]
We develop PerAct, a language-conditioned behavior-cloning agent for multi-task 6-DoF manipulation. PerAct encodes language goals and RGB-D voxel observations with a Perceiver Transformer, and outputs discretized actions by "detecting the next best voxel action" Our results show that PerAct significantly outperforms unstructured image-to-action agents and 3D ConvNet baselines for a wide range of tabletop tasks.
arXiv Detail & Related papers (2022-09-12T17:51:05Z)
The Parallelism Tradeoff: Limitations of Log-Precision Transformers [29.716269397142973]
We prove that transformers whose arithmetic precision is logarithmic in the number of input tokens can be simulated by constant-depth logspace-uniform threshold circuits. This provides insight on the power of transformers using known results in complexity theory.
arXiv Detail & Related papers (2022-07-02T03:49:34Z)
Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation. MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation. We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z)
On the Power of Saturated Transformers: A View from Circuit Complexity [87.20342701232869]
We show that saturated transformers transcend the limitations of hard-attention transformers. The jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(log n)$.
arXiv Detail & Related papers (2021-06-30T17:09:47Z)
Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks. We propose the Feedback Transformer architecture that exposes all previous representations to all future representations. We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.