Dynamic Position Encoding for Transformers
- URL: http://arxiv.org/abs/2204.08142v1
- Date: Mon, 18 Apr 2022 03:08:48 GMT
- Title: Dynamic Position Encoding for Transformers
- Authors: Joyce Zheng, Mehdi Rezagholizadeh, Peyman Passban
- Abstract summary: Recurrent models have been dominating the field of neural machine translation (NMT) for the past few years.
Transformers could fail to properly encode sequential/positional information due to their non-recurrent nature.
We propose a novel architecture with new position embeddings depending on the input text to address this shortcoming.
- Score: 18.315954297959617
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recurrent models have been dominating the field of neural machine translation
(NMT) for the past few years. Transformers \citep{vaswani2017attention}, have
radically changed it by proposing a novel architecture that relies on a
feed-forward backbone and self-attention mechanism. Although Transformers are
powerful, they could fail to properly encode sequential/positional information
due to their non-recurrent nature. To solve this problem, position embeddings
are defined exclusively for each time step to enrich word information. However,
such embeddings are fixed after training regardless of the task and the word
ordering system of the source or target language.
In this paper, we propose a novel architecture with new position embeddings
depending on the input text to address this shortcoming by taking the order of
target words into consideration. Instead of using predefined position
embeddings, our solution \textit{generates} new embeddings to refine each
word's position information. Since we do not dictate the position of source
tokens and learn them in an end-to-end fashion, we refer to our method as
\textit{dynamic} position encoding (DPE). We evaluated the impact of our model
on multiple datasets to translate from English into German, French, and Italian
and observed meaningful improvements in comparison to the original Transformer.
Related papers
- A Frustratingly Easy Improvement for Position Embeddings via Random
Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models.
Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z) - P-Transformer: Towards Better Document-to-Document Neural Machine
Translation [34.19199123088232]
We propose a position-aware Transformer (P-Transformer) to enhance both the absolute and relative position information.
P-Transformer can be applied to seq2seq-based document-to-sentence (Doc2Sent) and sentence-to-sentence (Sent2Sent) translation.
arXiv Detail & Related papers (2022-12-12T11:19:05Z) - Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone.
We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z) - Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it.
Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z) - Rewriter-Evaluator Architecture for Neural Machine Translation [17.45780516143211]
We present a novel architecture, Rewriter-Evaluator, for improving neural machine translation (NMT) models.
It consists of a rewriter and an evaluator. At every pass, the rewriter produces a new translation to improve the past translation and the evaluator estimates the translation quality to decide whether to terminate the rewriting process.
We conduct extensive experiments on two translation tasks, Chinese-English and English-German, and show that the proposed architecture notably improves the performances of NMT models.
arXiv Detail & Related papers (2020-12-10T02:21:34Z) - Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations.
We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z) - Segatron: Segment-Aware Transformer for Language Modeling and
Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens.
We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model.
We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z) - Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency.
We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT.
The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.