Multi-Unit Transformers for Neural Machine Translation
- URL: http://arxiv.org/abs/2010.10743v2
- Date: Fri, 23 Oct 2020 11:33:45 GMT
- Title: Multi-Unit Transformers for Neural Machine Translation
- Authors: Jianhao Yan, Fandong Meng, Jie Zhou
- Abstract summary: We propose the Multi-Unit Transformers (MUTE) to promote the expressiveness of the Transformer.
Specifically, we use several parallel units and show that modeling with multiple units improves model performance and introduces diversity.
- Score: 51.418245676894465
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformer models achieve remarkable success in Neural Machine Translation.
Many efforts have been devoted to deepening the Transformer by stacking several
units (i.e., a combination of Multihead Attentions and FFN) in a cascade, while
the investigation over multiple parallel units draws little attention. In this
paper, we propose the Multi-Unit Transformers (MUTE), which aim to promote the
expressiveness of the Transformer by introducing diverse and complementary
units. Specifically, we use several parallel units and show that modeling with
multiple units improves model performance and introduces diversity. Further, to
better leverage the advantage of the multi-unit setting, we design biased
module and sequential dependency that guide and encourage complementariness
among different units. Experimental results on three machine translation tasks,
the NIST Chinese-to-English, WMT'14 English-to-German and WMT'18
Chinese-to-English, show that the MUTE models significantly outperform the
Transformer-Base, by up to +1.52, +1.90 and +1.10 BLEU points, with only a mild
drop in inference speed (about 3.1%). In addition, our methods also surpass the
Transformer-Big model, with only 54\% of its parameters. These results
demonstrate the effectiveness of the MUTE, as well as its efficiency in both
the inference process and parameter usage.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Flash STU: Fast Spectral Transform Units [19.889367504937177]
This paper describes an efficient, open source PyTorch implementation of the Spectral Transform Unit.
We investigate sequence prediction tasks over several modalities including language, robotics, and simulated dynamical systems.
arXiv Detail & Related papers (2024-09-16T17:22:34Z) - Heterogeneous Encoders Scaling In The Transformer For Neural Machine
Translation [47.82947878753809]
We investigate the effectiveness of integrating an increasing number of heterogeneous methods.
Based on a simple combination strategy and performance-driven synergy criteria, we designed the Multi-Encoder Transformer.
Results showcased that our approach can improve the quality of the translation across a variety of languages and dataset sizes.
arXiv Detail & Related papers (2023-12-26T03:39:08Z) - Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision [45.69716658698776]
In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors.
We propose a variation-aware quantization scheme for both vision and language transformers.
Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement.
arXiv Detail & Related papers (2023-07-01T13:01:39Z) - GTrans: Grouping and Fusing Transformer Layers for Neural Machine
Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation.
We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z) - Towards Lightweight Transformer via Group-wise Transformation for
Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer.
We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets.
Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z) - Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks.
Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs.
We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z) - Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation.
We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters.
A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.