Related papers: Dynamic Position Encoding for Transformers

Dynamic Position Encoding for Transformers

URL: http://arxiv.org/abs/2204.08142v1
Date: Mon, 18 Apr 2022 03:08:48 GMT
Title: Dynamic Position Encoding for Transformers
Authors: Joyce Zheng, Mehdi Rezagholizadeh, Peyman Passban
Abstract summary: Recurrent models have been dominating the field of neural machine translation (NMT) for the past few years. Transformers could fail to properly encode sequential/positional information due to their non-recurrent nature. We propose a novel architecture with new position embeddings depending on the input text to address this shortcoming.
Score: 18.315954297959617
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Recurrent models have been dominating the field of neural machine translation (NMT) for the past few years. Transformers \citep{vaswani2017attention}, have radically changed it by proposing a novel architecture that relies on a feed-forward backbone and self-attention mechanism. Although Transformers are powerful, they could fail to properly encode sequential/positional information due to their non-recurrent nature. To solve this problem, position embeddings are defined exclusively for each time step to enrich word information. However, such embeddings are fixed after training regardless of the task and the word ordering system of the source or target language. In this paper, we propose a novel architecture with new position embeddings depending on the input text to address this shortcoming by taking the order of target words into consideration. Instead of using predefined position embeddings, our solution \textit{generates} new embeddings to refine each word's position information. Since we do not dictate the position of source tokens and learn them in an end-to-end fashion, we refer to our method as \textit{dynamic} position encoding (DPE). We evaluated the impact of our model on multiple datasets to translate from English into German, French, and Italian and observed meaningful improvements in comparison to the original Transformer.

Related papers

Rethinking Addressing in Language Models via Contexualized Equivariant Positional Encoding [89.52931576290976]
Transformers rely on both content-based and position-based addressing mechanisms to make predictions. TAPE is a novel framework that enhances positional embeddings by incorporating sequence content across layers. Our method can be easily integrated into pre-trained transformers, offering parameter-efficient fine-tuning with minimal overhead.
arXiv Detail & Related papers (2025-01-01T03:23:00Z)
A Frustratingly Easy Improvement for Position Embeddings via Random Padding [68.75670223005716]
In this paper, we propose a simple but effective strategy, Random Padding, without any modifications to existing pre-trained language models. Experiments show that Random Padding can significantly improve model performance on the instances whose answers are located at rear positions.
arXiv Detail & Related papers (2023-05-08T17:08:14Z)
P-Transformer: Towards Better Document-to-Document Neural Machine Translation [34.19199123088232]
We propose a position-aware Transformer (P-Transformer) to enhance both the absolute and relative position information. P-Transformer can be applied to seq2seq-based document-to-sentence (Doc2Sent) and sentence-to-sentence (Sent2Sent) translation.
arXiv Detail & Related papers (2022-12-12T11:19:05Z)
Word Order Matters when you Increase Masking [70.29624135819884]
We study the effect of removing position encodings on the pre-training objective itself, to test whether models can reconstruct position information from co-occurrences alone. We find that the necessity of position information increases with the amount of masking, and that masked language models without position encodings are not able to reconstruct this information on the task.
arXiv Detail & Related papers (2022-11-08T18:14:04Z)
Position Prediction as an Effective Pretraining Strategy [20.925906203643883]
We propose a novel, but surprisingly simple alternative to content reconstruction-- that of predicting locations from content, without providing positional information for it. Our approach brings improvements over strong supervised training baselines and is comparable to modern unsupervised/self-supervised pretraining methods.
arXiv Detail & Related papers (2022-07-15T17:10:48Z)
Rewriter-Evaluator Architecture for Neural Machine Translation [17.45780516143211]
We present a novel architecture, Rewriter-Evaluator, for improving neural machine translation (NMT) models. It consists of a rewriter and an evaluator. At every pass, the rewriter produces a new translation to improve the past translation and the evaluator estimates the translation quality to decide whether to terminate the rewriting process. We conduct extensive experiments on two translation tasks, Chinese-English and English-German, and show that the proposed architecture notably improves the performances of NMT models.
arXiv Detail & Related papers (2020-12-10T02:21:34Z)
Rethinking Positional Encoding in Language Pre-training [111.2320727291926]
We show that in absolute positional encoding, the addition operation applied on positional embeddings and word embeddings brings mixed correlations. We propose a new positional encoding method called textbfTransformer with textbfUntied textPositional textbfEncoding (T)
arXiv Detail & Related papers (2020-06-28T13:11:02Z)
Segatron: Segment-Aware Transformer for Language Modeling and Understanding [79.84562707201323]
We propose a segment-aware Transformer (Segatron) to generate better contextual representations from sequential tokens. We first introduce the segment-aware mechanism to Transformer-XL, which is a popular Transformer-based language model. We find that our method can further improve the Transformer-XL base model and large model, achieving 17.1 perplexity on the WikiText-103 dataset.
arXiv Detail & Related papers (2020-04-30T17:38:27Z)
Explicit Reordering for Neural Machine Translation [50.70683739103066]
In Transformer-based neural machine translation (NMT), the positional encoding mechanism helps the self-attention networks to learn the source representation with order dependency. We propose a novel reordering method to explicitly model this reordering information for the Transformer-based NMT. The empirical results on the WMT14 English-to-German, WAT ASPEC Japanese-to-English, and WMT17 Chinese-to-English translation tasks show the effectiveness of the proposed approach.
arXiv Detail & Related papers (2020-04-08T05:28:46Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.