Tighter Bounds on the Expressivity of Transformer Encoders
- URL: http://arxiv.org/abs/2301.10743v3
- Date: Mon, 13 Nov 2023 15:19:02 GMT
- Title: Tighter Bounds on the Expressivity of Transformer Encoders
- Authors: David Chiang and Peter Cholak and Anand Pillay
- Abstract summary: We identify a variant of first-order logic with counting quantifiers that is simultaneously an upper bound for fixed-precision transformer encoders and a lower bound for transformer encoders.
This brings us much closer than before to an exact characterization of the languages that transformer encoders recognize.
- Score: 9.974865253097127
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Characterizing neural networks in terms of better-understood formal systems
has the potential to yield new insights into the power and limitations of these
networks. Doing so for transformers remains an active area of research.
Bhattamishra and others have shown that transformer encoders are at least as
expressive as a certain kind of counter machine, while Merrill and Sabharwal
have shown that fixed-precision transformer encoders recognize only languages
in uniform $TC^0$. We connect and strengthen these results by identifying a
variant of first-order logic with counting quantifiers that is simultaneously
an upper bound for fixed-precision transformer encoders and a lower bound for
transformer encoders. This brings us much closer than before to an exact
characterization of the languages that transformer encoders recognize.
Related papers
- Transformers are Efficient Compilers, Provably [11.459397066286822]
Transformer-based large language models (LLMs) have demonstrated surprisingly robust performance across a wide range of language-related tasks.
In this paper, we take the first steps towards a formal investigation of using transformers as compilers from an expressive power perspective.
We introduce a representative programming language, Mini-Husky, which encapsulates key features of modern C-like languages.
arXiv Detail & Related papers (2024-10-07T20:31:13Z) - Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study [52.91899050612153]
transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR)
Our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated.
This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.
arXiv Detail & Related papers (2024-09-26T11:31:18Z) - On the Expressive Power of a Variant of the Looped Transformer [83.30272757948829]
We design a novel transformer block, dubbed AlgoFormer, to empower transformers with algorithmic capabilities.
The proposed AlgoFormer can achieve significantly higher in algorithm representation when using the same number of parameters.
Some theoretical and empirical results are presented to show that the designed transformer has the potential to be smarter than human-designed algorithms.
arXiv Detail & Related papers (2024-02-21T07:07:54Z) - DecoderLens: Layerwise Interpretation of Encoder-Decoder Transformers [6.405360669408265]
We propose a simple, new method to analyze encoder-decoder Transformers: DecoderLens.
Inspired by the LogitLens (for decoder-only Transformers), this method involves allowing the decoder to cross-attend representations of intermediate encoder layers.
We report results from the DecoderLens applied to models trained on question answering, logical reasoning, speech recognition and machine translation.
arXiv Detail & Related papers (2023-10-05T17:04:59Z) - Error Correction Code Transformer [92.10654749898927]
We propose to extend for the first time the Transformer architecture to the soft decoding of linear codes at arbitrary block lengths.
We encode each channel's output dimension to high dimension for better representation of the bits information to be processed separately.
The proposed approach demonstrates the extreme power and flexibility of Transformers and outperforms existing state-of-the-art neural decoders by large margins at a fraction of their time complexity.
arXiv Detail & Related papers (2022-03-27T15:25:58Z) - Redesigning the Transformer Architecture with Insights from
Multi-particle Dynamical Systems [32.86421107987556]
We build upon recent developments in analyzing deep neural networks as numerical solvers of ordinary differential equations.
We formulate a temporal evolution scheme, TransEvolve, to bypass costly dot-product attention over multiple stacked layers.
We perform exhaustive experiments with TransEvolve on well-known encoder-decoder as well as encoder-only tasks.
arXiv Detail & Related papers (2021-09-30T14:01:06Z) - Multi-Stream Transformers [10.633944207679274]
Transformer-based encoder-decoder models produce a fused token-wise representation after every encoder layer.
We investigate the effects of allowing the encoder to preserve and explore alternative hypotheses, combined at the end of the encoding process.
arXiv Detail & Related papers (2021-07-21T20:16:57Z) - Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language.
We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer.
We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z) - On the Sub-Layer Functionalities of Transformer Decoder [74.83087937309266]
We study how Transformer-based decoders leverage information from the source and target languages.
Based on these insights, we demonstrate that the residual feed-forward module in each Transformer decoder layer can be dropped with minimal loss of performance.
arXiv Detail & Related papers (2020-10-06T11:50:54Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.