Can Transformers Learn $n$-gram Language Models?
- URL: http://arxiv.org/abs/2410.03001v1
- Date: Thu, 3 Oct 2024 21:21:02 GMT
- Title: Can Transformers Learn $n$-gram Language Models?
- Authors: Anej Svete, Nadav Borenstein, Mike Zhou, Isabelle Augenstein, Ryan Cotterell,
- Abstract summary: We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
- Score: 77.35809823602307
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much theoretical work has described the ability of transformers to represent formal languages. However, linking theoretical results to empirical performance is not straightforward due to the complex interplay between the architecture, the learning algorithm, and training data. To test whether theoretical lower bounds imply \emph{learnability} of formal languages, we turn to recent work relating transformers to $n$-gram language models (LMs). We study transformers' ability to learn random $n$-gram LMs of two kinds: ones with arbitrary next-symbol probabilities and ones where those are defined with shared parameters. We find that classic estimation techniques for $n$-gram LMs such as add-$\lambda$ smoothing outperform transformers on the former, while transformers perform better on the latter, outperforming methods specifically designed to learn $n$-gram LMs.
Related papers
- On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
We study in-context learning for linear regression with diverse tasks.
We show that multilayer Transformers are not robust to even distributional shifts as small as $O(e-L)$ in Wasserstein distance.
arXiv Detail & Related papers (2024-10-29T03:27:56Z) - Bypassing the Exponential Dependency: Looped Transformers Efficiently Learn In-context by Multi-step Gradient Descent [26.764893400499354]
We show that linear looped Transformers can implement multi-step gradient descent efficiently for in-context learning.
Our results demonstrate that as long as the input data has a constant condition number, $n = O(d)$, the linear looped Transformers can achieve a small error.
arXiv Detail & Related papers (2024-10-15T04:44:23Z) - Transformers Can Represent $n$-gram Language Models [56.06361029539347]
We focus on the relationship between transformer LMs and $n$-gram LMs, a simple and historically relevant class of language models.
We show that transformer LMs using the hard or sparse attention mechanisms can exactly represent any $n$-gram LM.
arXiv Detail & Related papers (2024-04-23T12:51:37Z) - Counting Like Transformers: Compiling Temporal Counting Logic Into Softmax Transformers [8.908747084128397]
We introduce the temporal counting logic $textbfK_textt$[#] alongside the RASP variant $textbfC-RASP$.
We show they are equivalent to each other, and that together they are the best-known lower bound on the formal expressivity of future-masked soft attention transformers.
arXiv Detail & Related papers (2024-04-05T20:36:30Z) - How do Transformers perform In-Context Autoregressive Learning? [76.18489638049545]
We train a Transformer model on a simple next token prediction task.
We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping.
arXiv Detail & Related papers (2024-02-08T16:24:44Z) - Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models [23.944430707096103]
We show that Transformers learn to approximate higher-order optimization methods for in-context linear regression.
For in-context linear regression, Transformers share a similar convergence rate as Iterative Newton's Method.
We also show that Transformers can learn in-context on ill-conditioned data, a setting where Gradient Descent struggles but Iterative Newton succeeds.
arXiv Detail & Related papers (2023-10-26T01:08:47Z) - Transformers Learn Shortcuts to Automata [52.015990420075944]
We find that a low-depth Transformer can represent the computations of any finite-state automaton.
We show that a Transformer with $O(log T)$ layers can exactly replicate the computation of an automaton on an input sequence of length $T$.
We further investigate the brittleness of these solutions and propose potential mitigations.
arXiv Detail & Related papers (2022-10-19T17:45:48Z) - Learning Bounded Context-Free-Grammar via LSTM and the
Transformer:Difference and Explanations [51.77000472945441]
Long Short-Term Memory (LSTM) and Transformers are two popular neural architectures used for natural language processing tasks.
In practice, it is often observed that Transformer models have better representation power than LSTM.
We study such practical differences between LSTM and Transformer and propose an explanation based on their latent space decomposition patterns.
arXiv Detail & Related papers (2021-12-16T19:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.