Transformers Can Represent $n$-gram Language Models
- URL: http://arxiv.org/abs/2404.14994v3
- Date: Thu, 20 Jun 2024 15:21:23 GMT
- Title: Transformers Can Represent $n$-gram Language Models
- Authors: Anej Svete, Ryan Cotterell,
- Abstract summary: We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models.
We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
- Score: 56.06361029539347
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing work has analyzed the representational capacity of the transformer architecture by means of formal models of computation. However, the focus so far has been on analyzing the architecture in terms of language \emph{acceptance}. We contend that this is an ill-suited problem in the study of \emph{language models} (LMs), which are definitionally \emph{probability distributions} over strings. In this paper, we focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models. We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM, giving us a concrete lower bound on their probabilistic representational capacity. This provides a first step towards understanding the mechanisms that transformer LMs can use to represent probability distributions over strings.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective.
We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state.
Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - On the Representational Capacity of Neural Language Models with Chain-of-Thought Reasoning [87.73401758641089]
Chain-of-thought (CoT) reasoning has improved the performance of modern language models (LMs)
We show that LMs can represent the same family of distributions over strings as probabilistic Turing machines.
arXiv Detail & Related papers (2024-06-20T10:59:02Z) - Transformers need glasses! Information over-squashing in language tasks [18.81066657470662]
We study how information propagates in decoder-only Transformers.
We show that certain sequences of inputs to the Transformer can yield arbitrarily close representations in the final token.
We also show that decoder-only Transformer language models can lose sensitivity to specific tokens in the input.
arXiv Detail & Related papers (2024-06-06T17:14:44Z) - Toward a Theory of Tokenization in LLMs [26.516041872337887]
We study tokenization from a theoretical point of view by studying the behavior of transformers on simple data generating processes.
We show that even the simplest unigram models (over tokens) learnt by transformers are able to model the probability of sequences drawn from $ktextth$-order Markov sources near optimally.
arXiv Detail & Related papers (2024-04-12T09:01:14Z) - Linearity of Relation Decoding in Transformer Language Models [82.47019600662874]
Much of the knowledge encoded in transformer language models (LMs) may be expressed in terms of relations.
We show that, for a subset of relations, this computation is well-approximated by a single linear transformation on the subject representation.
arXiv Detail & Related papers (2023-08-17T17:59:19Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.