Fast Training of NMT Model with Data Sorting
- URL: http://arxiv.org/abs/2308.08153v1
- Date: Wed, 16 Aug 2023 05:48:50 GMT
- Title: Fast Training of NMT Model with Data Sorting
- Authors: Daniela N. Rim, Kimera Richard, Heeyoul Choi
- Abstract summary: The Transformer model has revolutionized Natural Language Processing tasks such as Neural Machine Translation.
One potential area for improvement is to address the study of empty tokens that the Transformer computes only to discard them later.
We propose an algorithm that sorts sentence pairs based on their length before translation, minimizing the waste of computing power.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The Transformer model has revolutionized Natural Language Processing tasks
such as Neural Machine Translation, and many efforts have been made to study
the Transformer architecture, which increased its efficiency and accuracy. One
potential area for improvement is to address the computation of empty tokens
that the Transformer computes only to discard them later, leading to an
unnecessary computational burden. To tackle this, we propose an algorithm that
sorts translation sentence pairs based on their length before batching,
minimizing the waste of computing power. Since the amount of sorting could
violate the independent and identically distributed (i.i.d) data assumption, we
sort the data partially. In experiments, we apply the proposed method to
English-Korean and English-Luganda language pairs for machine translation and
show that there are gains in computational time while maintaining the
performance. Our method is independent of architectures, so that it can be
easily integrated into any training process with flexible data lengths.
Related papers
- Exploring Linguistic Similarity and Zero-Shot Learning for Multilingual
Translation of Dravidian Languages [0.34998703934432673]
We build a single-decoder neural machine translation system for Dravidian-Dravidian multilingual translation.
Our model achieves scores within 3 BLEU of large-scale pivot-based models when it is trained on 50% of the language directions.
arXiv Detail & Related papers (2023-08-10T13:38:09Z) - No Train No Gain: Revisiting Efficient Training Algorithms For
Transformer-based Language Models [31.080446886440757]
In this work, we revisit three categories of such algorithms: dynamic architectures (layer stacking, dropping), batch selection (selective backprop, RHO loss), and efficient layers (Lion, Sophia)
We find that their training, validation, and downstream gains vanish compared to a baseline with a fully-decayed learning rate.
We define an evaluation protocol that enables machines to be done on arbitrary computation by mapping all computation time to a reference machine which we call reference system time.
arXiv Detail & Related papers (2023-07-12T20:10:14Z) - On the Pareto Front of Multilingual Neural Machine Translation [123.94355117635293]
We study how the performance of a given direction changes with its sampling ratio in Neural Machine Translation (MNMT)
We propose the Double Power Law to predict the unique performance trade-off front in MNMT.
In our experiments, it achieves better performance than temperature searching and gradient manipulation methods with only 1/5 to 1/2 of the total training budget.
arXiv Detail & Related papers (2023-04-06T16:49:19Z) - Transkimmer: Transformer Learns to Layer-wise Skim [17.188613474427054]
One of the major computational inefficiency of Transformer-based models is that they spend identical amount of computation throughout all layers.
We propose Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer.
The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers.
arXiv Detail & Related papers (2022-05-15T16:23:30Z) - Linearizing Transformer with Key-Value Memory Bank [54.83663647680612]
We propose MemSizer, an approach to project the source sequence into lower dimension representation.
MemSizer not only achieves the same linear time complexity but also enjoys efficient recurrent-style autoregressive generation.
We demonstrate that MemSizer provides an improved tradeoff between efficiency and accuracy over the vanilla transformer.
arXiv Detail & Related papers (2022-03-23T18:10:18Z) - Discovering Non-monotonic Autoregressive Orderings with Variational
Inference [67.27561153666211]
We develop an unsupervised parallelizable learner that discovers high-quality generation orders purely from training data.
We implement the encoder as a Transformer with non-causal attention that outputs permutations in one forward pass.
Empirical results in language modeling tasks demonstrate that our method is context-aware and discovers orderings that are competitive with or even better than fixed orders.
arXiv Detail & Related papers (2021-10-27T16:08:09Z) - Improving Neural Machine Translation by Bidirectional Training [85.64797317290349]
We present a simple and effective pretraining strategy -- bidirectional training (BiT) for neural machine translation.
Specifically, we bidirectionally update the model parameters at the early stage and then tune the model normally.
Experimental results show that BiT pushes the SOTA neural machine translation performance across 15 translation tasks on 8 language pairs significantly higher.
arXiv Detail & Related papers (2021-09-16T07:58:33Z) - GroupBERT: Enhanced Transformer Architecture with Efficient Grouped
Structures [57.46093180685175]
We demonstrate a set of modifications to the structure of a Transformer layer, producing a more efficient architecture.
We add a convolutional module to complement the self-attention module, decoupling the learning of local and global interactions.
We apply the resulting architecture to language representation learning and demonstrate its superior performance compared to BERT models of different scales.
arXiv Detail & Related papers (2021-06-10T15:41:53Z) - Finetuning Pretrained Transformers into RNNs [81.72974646901136]
Transformers have outperformed recurrent neural networks (RNNs) in natural language generation.
A linear-complexity recurrent variant has proven well suited for autoregressive generation.
This work aims to convert a pretrained transformer into its efficient recurrent counterpart.
arXiv Detail & Related papers (2021-03-24T10:50:43Z) - Lexically Constrained Neural Machine Translation with Levenshtein
Transformer [8.831954614241234]
This paper proposes a simple and effective algorithm for incorporating lexical constraints in neural machine translation.
Our method injects terminology constraints at inference time without any impact on decoding speed.
arXiv Detail & Related papers (2020-04-27T09:59:27Z) - Controlling Computation versus Quality for Neural Sequence Models [42.525463454120256]
Conditional computation makes neural sequence models (Transformers) more efficient and computation-aware during inference.
We evaluate our approach on two tasks: (i) WMT English-French Translation and (ii) Unsupervised representation learning (BERT)
arXiv Detail & Related papers (2020-02-17T17:54:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.