Related papers: Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding

URL: http://arxiv.org/abs/2410.12413v1
Date: Wed, 16 Oct 2024 09:56:01 GMT
Title: Theoretical Analysis of Hierarchical Language Recognition and Generation by Transformers without Positional Encoding
Authors: Daichi Hayakawa, Issei Sato,
Abstract summary: We show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages.
Score: 32.01426831450348
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this study, we provide constructive proof that Transformers can recognize and generate hierarchical language efficiently with respect to model size, even without the need for a specific positional encoding. Specifically, we show that causal masking and a starting token enable Transformers to compute positional information and depth within hierarchical structures. We demonstrate that Transformers without positional encoding can generate hierarchical languages. Furthermore, we suggest that explicit positional encoding might have a detrimental effect on generalization with respect to sequence length.

Related papers

Enhancing Transformers for Generalizable First-Order Logical Entailment [51.04944136538266]
This paper investigates the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge. The first-order reasoning capability of transformers is assessed through their ability to perform first-order logical entailment. We propose a more sophisticated, logic-aware architecture, TEGA, to enhance the capability for generalizable first-order logical entailment in transformers.
arXiv Detail & Related papers (2025-01-01T07:05:32Z)
Position Information Emerges in Causal Transformers Without Positional Encodings via Similarity of Nearby Embeddings [3.0559252110342703]
We propose and investigate a new hypothesis about how positional information can be stored without using explicit positional encoding. We observe that nearby embeddings are more similar to each other than faraway embeddings, allowing the transformer to potentially reconstruct the positions of tokens.
arXiv Detail & Related papers (2024-12-30T03:35:41Z)
Improving Transformers using Faithful Positional Encoding [55.30212768657544]
We propose a new positional encoding method for a neural network architecture called the Transformer. Unlike the standard sinusoidal positional encoding, our approach has a guarantee of not losing information about the positional order of the input sequence.
arXiv Detail & Related papers (2024-05-15T03:17:30Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
Systematic Generalization and Emergent Structures in Transformers Trained on Structured Tasks [6.525090891505941]
We show how a causal transformer can perform a set of algorithmic tasks, including copying, sorting, and hierarchical compositions. We show that two-layer transformers learn generalizable solutions to multi-level problems and develop signs of systematic task decomposition. These results provide key insights into how transformer models may be capable of decomposing complex decisions into reusable, multi-level policies.
arXiv Detail & Related papers (2022-10-02T00:46:36Z)
Structural Biases for Improving Transformers on Translation into Morphologically Rich Languages [120.74406230847904]
TP-Transformer augments the traditional Transformer architecture to include an additional component to represent structure. The second method imbues structure at the data level by segmenting the data with morphological tokenization. We find that each of these two approaches allows the network to achieve better performance, but this improvement is dependent on the size of the dataset.
arXiv Detail & Related papers (2022-08-11T22:42:24Z)
Transformer with Tree-order Encoding for Neural Program Generation [8.173517923612426]
We introduce a tree-based positional encoding and a shared natural-language subword vocabulary for Transformers. Our findings suggest that employing a tree-based positional encoding in combination with a shared natural-language subword vocabulary improves generation performance over sequential positional encodings.
arXiv Detail & Related papers (2022-05-30T12:27:48Z)
Transformer Language Models without Positional Encodings Still Learn Positional Information [45.42248458957122]
We find that transformer language models without any explicit positional encoding are still competitive with standard models. We conjecture that causal attention enables the model to infer the number of predecessors that each token can attend to, thereby approximating its absolute position.
arXiv Detail & Related papers (2022-03-30T19:37:07Z)
Thinking Like Transformers [64.96770952820691]
We propose a computational model for the transformer-encoder in the form of a programming language. We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer. We provide RASP programs for histograms, sorting, and Dyck-languages.
arXiv Detail & Related papers (2021-06-13T13:04:46Z)
Do We Really Need Explicit Position Encodings for Vision Transformers? [29.7662570764424]
We propose a conditional position encoding scheme, which is conditioned on the local neighborhood of the input token. Our new model with PEG is named Visual Transformer (CPVT) and can naturally process the input sequences of arbitrary length. We demonstrate that CPVT can result in visually similar attention maps and even better performance than those with predefined positional encodings.
arXiv Detail & Related papers (2021-02-22T10:29:55Z)
Segatron: Segment-Aware Transformer for Language Modeling and Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
On Sparsifying Encoder Outputs in Sequence-to-Sequence Models [90.58793284654692]
We take Transformer as the testbed and introduce a layer of gates in-between the encoder and the decoder. The gates are regularized using the expected value of the sparsity-inducing L0penalty. We investigate the effects of this sparsification on two machine translation and two summarization tasks.
arXiv Detail & Related papers (2020-04-24T16:57:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.