Transformer++
- URL: http://arxiv.org/abs/2003.04974v1
- Date: Mon, 2 Mar 2020 13:00:16 GMT
- Title: Transformer++
- Authors: Prakhar Thapak and Prodip Hore
- Abstract summary: Transformer using attention mechanism solely achieved state-of-the-art results in sequence modeling.
We propose a new way of learning dependencies through a context in multi-head using convolution.
New form of multi-head attention along with the traditional form achieves better results than Transformer on the WMT 2014 English-to-German and English-to-French translation tasks.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in attention mechanisms have replaced recurrent neural
networks and its variants for machine translation tasks. Transformer using
attention mechanism solely achieved state-of-the-art results in sequence
modeling. Neural machine translation based on the attention mechanism is
parallelizable and addresses the problem of handling long-range dependencies
among words in sentences more effectively than recurrent neural networks. One
of the key concepts in attention is to learn three matrices, query, key, and
value, where global dependencies among words are learned through linearly
projecting word embeddings through these matrices. Multiple query, key, value
matrices can be learned simultaneously focusing on a different subspace of the
embedded dimension, which is called multi-head in Transformer. We argue that
certain dependencies among words could be learned better through an
intermediate context than directly modeling word-word dependencies. This could
happen due to the nature of certain dependencies or lack of patterns that lend
them difficult to be modeled globally using multi-head self-attention. In this
work, we propose a new way of learning dependencies through a context in
multi-head using convolution. This new form of multi-head attention along with
the traditional form achieves better results than Transformer on the WMT 2014
English-to-German and English-to-French translation tasks. We also introduce a
framework to learn POS tagging and NER information during the training of
encoder which further improves results achieving a new state-of-the-art of 32.1
BLEU, better than existing best by 1.4 BLEU, on the WMT 2014 English-to-German
and 44.6 BLEU, better than existing best by 1.1 BLEU, on the WMT 2014
English-to-French translation tasks. We call this Transformer++.
Related papers
- Pointer-Generator Networks for Low-Resource Machine Translation: Don't Copy That! [13.120825574589437]
We show that Transformer-based neural machine translation (NMT) is very effective in high-resource settings.
We show that the model does not show greater improvements for closely-related vs. more distant language pairs.
Our discussion of the reasons for this behaviour highlights several general challenges for LR NMT.
arXiv Detail & Related papers (2024-03-16T16:17:47Z) - Extending Multilingual Machine Translation through Imitation Learning [60.15671816513614]
Imit-MNMT treats the task as an imitation learning process, which mimicks the behavior of an expert.
We show that our approach significantly improves the translation performance between the new and the original languages.
We also demonstrate that our approach is capable of solving copy and off-target problems.
arXiv Detail & Related papers (2023-11-14T21:04:03Z) - Pre-Training a Graph Recurrent Network for Language Representation [34.4554387894105]
We consider a graph recurrent network for language model pre-training, which builds a graph structure for each sequence with local token-level communications.
We find that our model can generate more diverse outputs with less contextualized feature redundancy than existing attention-based models.
arXiv Detail & Related papers (2022-09-08T14:12:15Z) - Language Modeling, Lexical Translation, Reordering: The Training Process
of NMT through the Lens of Classical SMT [64.1841519527504]
neural machine translation uses a single neural network to model the entire translation process.
Despite neural machine translation being de-facto standard, it is still not clear how NMT models acquire different competences over the course of training.
arXiv Detail & Related papers (2021-09-03T09:38:50Z) - Revisiting Simple Neural Probabilistic Language Models [27.957834093475686]
This paper revisits the neural probabilistic language model (NPLM) ofcitetBengio2003ANP.
When scaled up to modern hardware, this model performs much better than expected on word-level language model benchmarks.
Inspired by this result, we modify the Transformer by replacing its first self-attention layer with the NPLM's local concatenation layer.
arXiv Detail & Related papers (2021-04-08T02:18:47Z) - VECO: Variable and Flexible Cross-lingual Pre-training for Language
Understanding and Generation [77.82373082024934]
We plug a cross-attention module into the Transformer encoder to explicitly build the interdependence between languages.
It can effectively avoid the degeneration of predicting masked words only conditioned on the context in its own language.
The proposed cross-lingual model delivers new state-of-the-art results on various cross-lingual understanding tasks of the XTREME benchmark.
arXiv Detail & Related papers (2020-10-30T03:41:38Z) - Dynamic Context-guided Capsule Network for Multimodal Machine
Translation [131.37130887834667]
Multimodal machine translation (MMT) mainly focuses on enhancing text-only translation with visual features.
We propose a novel Dynamic Context-guided Capsule Network (DCCN) for MMT.
Experimental results on the Multi30K dataset of English-to-German and English-to-French translation demonstrate the superiority of DCCN.
arXiv Detail & Related papers (2020-09-04T06:18:24Z) - Learning Source Phrase Representations for Neural Machine Translation [65.94387047871648]
We propose an attentive phrase representation generation mechanism which is able to generate phrase representations from corresponding token representations.
In our experiments, we obtain significant improvements on the WMT 14 English-German and English-French tasks on top of the strong Transformer baseline.
arXiv Detail & Related papers (2020-06-25T13:43:11Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.