Why transformers are obviously good models of language
- URL: http://arxiv.org/abs/2408.03855v1
- Date: Wed, 7 Aug 2024 15:52:46 GMT
- Title: Why transformers are obviously good models of language
- Authors: Felix Hill,
- Abstract summary: Transformers are neural networks that process language automatically with more success than alternatives.
I highlight direct connections between the transformer architecture and certain theoretical perspectives on language.
- Score: 19.214362677861025
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Nobody knows how language works, but many theories abound. Transformers are a class of neural networks that process language automatically with more success than alternatives, both those based on neural computations and those that rely on other (e.g. more symbolic) mechanisms. Here, I highlight direct connections between the transformer architecture and certain theoretical perspectives on language. The empirical success of transformers relative to alternative models provides circumstantial evidence that the linguistic approaches that transformers embody should be, at least, evaluated with greater scrutiny by the linguistics community and, at best, considered to be the currently best available theories.
Related papers
- Extracting Finite State Machines from Transformers [0.3069335774032178]
We investigate the trainability of transformers trained on regular languages from a mechanistic interpretability perspective.
We empirically find tighter lower bounds on the trainability of transformers, when a finite number of symbols determine the state.
Our mechanistic insight allows us to characterise the regular languages a one-layer transformer can learn with good length generalisation.
arXiv Detail & Related papers (2024-10-08T13:43:50Z) - Can Transformers Learn $n$-gram Language Models? [77.35809823602307]
We study transformers' ability to learn random $n$-gram LMs of two kinds.
We find that classic estimation techniques for $n$-gram LMs such as add-$lambda$ smoothing outperform transformers.
arXiv Detail & Related papers (2024-10-03T21:21:02Z) - A Transformer with Stack Attention [84.18399019794036]
We propose augmenting transformer-based language models with a differentiable, stack-based attention mechanism.
Our stack-based attention mechanism can be incorporated into any transformer-based language model and adds a level of interpretability to the model.
We show that the addition of our stack-based attention mechanism enables the transformer to model some, but not all, deterministic context-free languages.
arXiv Detail & Related papers (2024-05-07T17:47:57Z) - Revenge of the Fallen? Recurrent Models Match Transformers at Predicting Human Language Comprehension Metrics [3.3932293160775298]
We show that contemporary recurrent models are able to match - and in some cases, exceed - the performance of comparably sized transformers at modeling online human language comprehension.
This suggests that transformer language models are not uniquely suited to this task, and opens up new directions for debates about the extent to which architectural features of language models make them better or worse models of human language comprehension.
arXiv Detail & Related papers (2024-04-30T01:02:15Z) - Transformers are Expressive, But Are They Expressive Enough for Regression? [38.369337945109855]
We show that Transformers struggle to reliably approximate smooth functions, relying on piecewise constant approximations with sizable intervals.
By shedding light on these challenges, we advocate a refined understanding of Transformers' capabilities.
arXiv Detail & Related papers (2024-02-23T18:12:53Z) - Probabilistic Transformer: A Probabilistic Dependency Model for
Contextual Word Representation [52.270712965271656]
We propose a new model of contextual word representation, not from a neural perspective, but from a purely syntactic and probabilistic perspective.
We find that the graph of our model resembles transformers, with correspondences between dependencies and self-attention.
Experiments show that our model performs competitively to transformers on small to medium sized datasets.
arXiv Detail & Related papers (2023-11-26T06:56:02Z) - Characterizing Intrinsic Compositionality in Transformers with Tree
Projections [72.45375959893218]
neural models like transformers can route information arbitrarily between different parts of their input.
We show that transformers for three different tasks become more treelike over the course of training.
These trees are predictive of model behavior, with more tree-like models generalizing better on tests of compositional generalization.
arXiv Detail & Related papers (2022-11-02T17:10:07Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - Transformer visualization via dictionary learning: contextualized
embedding as a linear superposition of transformer factors [15.348047288817478]
We propose to use dictionary learning to open up "black boxes" as linear superpositions of transformer factors.
Through visualization, we demonstrate the hierarchical semantic structures captured by the transformer factors.
We hope this visualization tool can bring further knowledge and a better understanding of how transformer networks work.
arXiv Detail & Related papers (2021-03-29T20:51:33Z) - On the Ability and Limitations of Transformers to Recognize Formal
Languages [9.12267978757844]
We provide a construction of Transformers for a subclass of counter languages.
We find that Transformers do well on this subclass, and their learned mechanism strongly correlates with our construction.
Perhaps surprisingly, in contrast to LSTMs, Transformers do well only on a subset of regular languages with degrading performance.
arXiv Detail & Related papers (2020-09-23T17:21:33Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.