Related papers: Multi-Unit Transformers for Neural Machine Translation

Multi-Unit Transformers for Neural Machine Translation

URL: http://arxiv.org/abs/2010.10743v2
Date: Fri, 23 Oct 2020 11:33:45 GMT
Title: Multi-Unit Transformers for Neural Machine Translation
Authors: Jianhao Yan, Fandong Meng, Jie Zhou
Abstract summary: We propose the Multi-Unit Transformers (MUTE) to promote the expressiveness of the Transformer. Specifically, we use several parallel units and show that modeling with multiple units improves model performance and introduces diversity.
Score: 51.418245676894465
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models achieve remarkable success in Neural Machine Translation. Many efforts have been devoted to deepening the Transformer by stacking several units (i.e., a combination of Multihead Attentions and FFN) in a cascade, while the investigation over multiple parallel units draws little attention. In this paper, we propose the Multi-Unit Transformers (MUTE), which aim to promote the expressiveness of the Transformer by introducing diverse and complementary units. Specifically, we use several parallel units and show that modeling with multiple units improves model performance and introduces diversity. Further, to better leverage the advantage of the multi-unit setting, we design biased module and sequential dependency that guide and encourage complementariness among different units. Experimental results on three machine translation tasks, the NIST Chinese-to-English, WMT'14 English-to-German and WMT'18 Chinese-to-English, show that the MUTE models significantly outperform the Transformer-Base, by up to +1.52, +1.90 and +1.10 BLEU points, with only a mild drop in inference speed (about 3.1%). In addition, our methods also surpass the Transformer-Big model, with only 54\% of its parameters. These results demonstrate the effectiveness of the MUTE, as well as its efficiency in both the inference process and parameter usage.

Related papers

Attention Is Not All You Need: The Importance of Feedforward Networks in Transformer Models [0.0]
State-of-the-art models can have over a hundred transformer blocks, containing billions of trainable parameters, and are trained on trillions of tokens of text.<n>We show that models using a transformer block configuration with three-layer FFNs with fewer such blocks outperform the standard two-layer configuration delivering lower training loss with fewer total parameters in less time.
arXiv Detail & Related papers (2025-05-10T12:54:21Z)
UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths [17.68867710994329]
UniMoD is a task-aware token pruning method that employs a separate router for each task to determine which tokens should be pruned. We apply our method to Show-o and Emu3, reducing training FLOPs by approximately 15% in Show-o and 40% in Emu3, while maintaining or improving performance on several benchmarks.
arXiv Detail & Related papers (2025-02-10T13:52:52Z)
Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors. We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor. Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z)
Flash STU: Fast Spectral Transform Units [19.889367504937177]
This paper describes an efficient, open source PyTorch implementation of the Spectral Transform Unit. We investigate sequence prediction tasks over several modalities including language, robotics, and simulated dynamical systems.
arXiv Detail & Related papers (2024-09-16T17:22:34Z)
Heterogeneous Encoders Scaling In The Transformer For Neural Machine Translation [47.82947878753809]
We investigate the effectiveness of integrating an increasing number of heterogeneous methods. Based on a simple combination strategy and performance-driven synergy criteria, we designed the Multi-Encoder Transformer. Results showcased that our approach can improve the quality of the translation across a variety of languages and dataset sizes.
arXiv Detail & Related papers (2023-12-26T03:39:08Z)
Quantization Variation: A New Perspective on Training Transformers with Low-Bit Precision [45.69716658698776]
In this paper, we identify the difficulty of transformer low-bit quantization-aware training on its unique variation behaviors. We propose a variation-aware quantization scheme for both vision and language transformers. Our solution substantially improves the 2-bit Swin-T and binary BERT-base, achieving a 3.35% and 1.4% accuracy improvement.
arXiv Detail & Related papers (2023-07-01T13:01:39Z)
GTrans: Grouping and Fusing Transformer Layers for Neural Machine Translation [107.2752114891855]
Transformer structure, stacked by a sequence of encoder and decoder network layers, achieves significant development in neural machine translation. We propose the Group-Transformer model (GTrans) that flexibly divides multi-layer representations of both encoder and decoder into different groups and then fuses these group features to generate target words.
arXiv Detail & Related papers (2022-07-29T04:10:36Z)
Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language Tasks [126.33843752332139]
We introduce Group-wise Transformation towards a universal yet lightweight Transformer for vision-and-language tasks, termed as LW-Transformer. We apply LW-Transformer to a set of Transformer-based networks, and quantitatively measure them on three vision-and-language tasks and six benchmark datasets. Experimental results show that while saving a large number of parameters and computations, LW-Transformer achieves very competitive performance against the original Transformer networks for vision-and-language tasks.
arXiv Detail & Related papers (2022-04-16T11:30:26Z)
Hierarchical Transformers Are More Efficient Language Models [19.061388006885686]
Transformer models yield impressive results on many NLP and sequence modeling tasks. Remarkably, Transformers can handle long sequences which allows them to produce long coherent outputs. We postulate that having an explicit hierarchical architecture is the key to Transformers that efficiently handle long sequences.
arXiv Detail & Related papers (2021-10-26T14:00:49Z)
Scalable Transformers for Neural Machine Translation [86.4530299266897]
Transformer has been widely adopted in Neural Machine Translation (NMT) because of its large capacity and parallel training of sequence generation. We propose a novel scalable Transformers, which naturally contains sub-Transformers of different scales and have shared parameters. A three-stage training scheme is proposed to tackle the difficulty of training the scalable Transformers.
arXiv Detail & Related papers (2021-06-04T04:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.