The Case for Translation-Invariant Self-Attention in Transformer-Based
Language Models
- URL: http://arxiv.org/abs/2106.01950v1
- Date: Thu, 3 Jun 2021 15:56:26 GMT
- Title: The Case for Translation-Invariant Self-Attention in Transformer-Based
Language Models
- Authors: Ulme Wennberg, Gustav Eje Henter
- Abstract summary: We analyze the position embeddings of existing language models and find strong evidence of translation invariance.
We propose translation-invariant self-attention (TISA), which accounts for the relative position between tokens in an interpretable fashion.
- Score: 11.148662334602639
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mechanisms for encoding positional information are central for
transformer-based language models. In this paper, we analyze the position
embeddings of existing language models, finding strong evidence of translation
invariance, both for the embeddings themselves and for their effect on
self-attention. The degree of translation invariance increases during training
and correlates positively with model performance. Our findings lead us to
propose translation-invariant self-attention (TISA), which accounts for the
relative position between tokens in an interpretable fashion without needing
conventional position embeddings. Our proposal has several theoretical
advantages over existing position-representation approaches. Experiments show
that it improves on regular ALBERT on GLUE tasks, while only adding orders of
magnitude less positional parameters.
Related papers
- Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs)
Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings.
By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z) - Context-Aware Machine Translation with Source Coreference Explanation [26.336947440529713]
We propose a model that explains the decisions made for translation by predicting coreference features in the input.
We evaluate our method in the WMT document-level translation task of English-German dataset, the English-Russian dataset, and the multilingual TED talk dataset.
arXiv Detail & Related papers (2024-04-30T12:41:00Z) - Latent Positional Information is in the Self-Attention Variance of
Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance.
Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z) - Multiplicative Position-aware Transformer Models for Language
Understanding [17.476450946279037]
Transformer models, which leverage architectural improvements like self-attention, perform remarkably well on Natural Language Processing (NLP) tasks.
In this paper, we review major existing position embedding methods and compare their accuracy on downstream NLP tasks.
We also propose a novel multiplicative embedding method which leads to superior accuracy when compared to existing methods.
arXiv Detail & Related papers (2021-09-27T04:18:32Z) - Improving Multilingual Translation by Representation and Gradient
Regularization [82.42760103045083]
We propose a joint approach to regularize NMT models at both representation-level and gradient-level.
Our results demonstrate that our approach is highly effective in both reducing off-target translation occurrences and improving zero-shot translation performance.
arXiv Detail & Related papers (2021-09-10T10:52:21Z) - Improving Zero-Shot Translation by Disentangling Positional Information [24.02434897109097]
We show that a main factor causing the language-specific representations is the positional correspondence to input tokens.
We gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions.
arXiv Detail & Related papers (2020-12-30T12:20:41Z) - Relative Positional Encoding for Speech Recognition and Direct
Translation [72.64499573561922]
We adapt the relative position encoding scheme to the Speech Transformer.
As a result, the network can better adapt to the variable distributions present in speech data.
arXiv Detail & Related papers (2020-05-20T09:53:06Z) - Self-Attention with Cross-Lingual Position Representation [112.05807284056337]
Position encoding (PE) is used to preserve the word order information for natural language processing tasks, generating fixed position indices for input sequences.
Due to word order divergences in different languages, modeling the cross-lingual positional relationships might help SANs tackle this problem.
We augment SANs with emphcross-lingual position representations to model the bilingually aware latent structure for the input sentence.
arXiv Detail & Related papers (2020-04-28T05:23:43Z) - Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency.
We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.