Long-Short Term Masking Transformer: A Simple but Effective Baseline for
Document-level Neural Machine Translation
- URL: http://arxiv.org/abs/2009.09127v1
- Date: Sat, 19 Sep 2020 00:29:51 GMT
- Title: Long-Short Term Masking Transformer: A Simple but Effective Baseline for
Document-level Neural Machine Translation
- Authors: Pei Zhang, Boxing Chen, Niyu Ge, Kai Fan
- Abstract summary: We study the pros and cons of the standard transformer in document-level translation.
We propose a surprisingly simple long-short term masking self-attention on top of the standard transformer.
We can achieve a strong result in BLEU and capture discourse phenomena.
- Score: 28.94748226472447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Many document-level neural machine translation (NMT) systems have explored
the utility of context-aware architecture, usually requiring an increasing
number of parameters and computational complexity. However, few attention is
paid to the baseline model. In this paper, we research extensively the pros and
cons of the standard transformer in document-level translation, and find that
the auto-regressive property can simultaneously bring both the advantage of the
consistency and the disadvantage of error accumulation. Therefore, we propose a
surprisingly simple long-short term masking self-attention on top of the
standard transformer to both effectively capture the long-range dependence and
reduce the propagation of errors. We examine our approach on the two publicly
available document-level datasets. We can achieve a strong result in BLEU and
capture discourse phenomena.
Related papers
- Towards Inducing Document-Level Abilities in Standard Multilingual Neural Machine Translation Models [4.625277907331917]
This work addresses the challenge of transitioning pre-trained NMT models from absolute sinusoidal PEs to relative PEs.
We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition.
We find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization.
arXiv Detail & Related papers (2024-08-21T07:23:34Z) - Long-Range Transformer Architectures for Document Understanding [1.9331361036118608]
Document Understanding (DU) was not left behind with first Transformer based models for DU dating from late 2019.
We introduce 2 new multi-modal (text + layout) long-range models for DU based on efficient implementations of Transformers for long sequences.
Relative 2D attention revealed to be effective on dense text for both normal and long-range models.
arXiv Detail & Related papers (2023-09-11T14:45:24Z) - Attention over pre-trained Sentence Embeddings for Long Document
Classification [4.38566347001872]
transformers are often limited to short sequences due to their quadratic attention complexity on the number of tokens.
We suggest to take advantage of pre-trained sentence transformers to start from semantically meaningful embeddings of the individual sentences.
We report the results obtained by this simple architecture on three standard document classification datasets.
arXiv Detail & Related papers (2023-07-18T09:06:35Z) - Improving Long Context Document-Level Machine Translation [51.359400776242786]
Document-level context for neural machine translation (NMT) is crucial to improve translation consistency and cohesion.
Many works have been published on the topic of document-level NMT, but most restrict the system to just local context.
We propose a constrained attention variant that focuses the attention on the most relevant parts of the sequence, while simultaneously reducing the memory consumption.
arXiv Detail & Related papers (2023-06-08T13:28:48Z) - Modeling Context With Linear Attention for Scalable Document-Level
Translation [72.41955536834702]
We investigate the efficacy of a recent linear attention model on document translation and augment it with a sentential gate to promote a recency inductive bias.
We show that sentential gating further improves translation quality on IWSLT.
arXiv Detail & Related papers (2022-10-16T03:41:50Z) - Learn To Remember: Transformer with Recurrent Memory for Document-Level
Machine Translation [14.135048254120615]
We introduce a recurrent memory unit to the vanilla Transformer, which supports the information exchange between the sentence and previous context.
We conduct experiments on three popular datasets for document-level machine translation and our model has an average improvement of 0.91 s-BLEU over the sentence-level baseline.
arXiv Detail & Related papers (2022-05-03T14:55:53Z) - HETFORMER: Heterogeneous Transformer with Sparse Attention for Long-Text
Extractive Summarization [57.798070356553936]
HETFORMER is a Transformer-based pre-trained model with multi-granularity sparse attentions for extractive summarization.
Experiments on both single- and multi-document summarization tasks show that HETFORMER achieves state-of-the-art performance in Rouge F1.
arXiv Detail & Related papers (2021-10-12T22:42:31Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - Rethinking Document-level Neural Machine Translation [73.42052953710605]
We try to answer the question: Is the capacity of current models strong enough for document-level translation?
We observe that the original Transformer with appropriate training techniques can achieve strong results for document translation, even with a length of 2000 words.
arXiv Detail & Related papers (2020-10-18T11:18:29Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.