Scaling Laws for Neural Machine Translation
- URL: http://arxiv.org/abs/2109.07740v1
- Date: Thu, 16 Sep 2021 06:15:20 GMT
- Title: Scaling Laws for Neural Machine Translation
- Authors: Behrooz Ghorbani, Orhan Firat, Markus Freitag, Ankur Bapna, Maxim
Krikun, Xavier Garcia, Ciprian Chelba, Colin Cherry
- Abstract summary: We show that cross-entropy loss as a function of model size follows a certain scaling law.
We also investigate the relationship between the cross-entropy loss and the quality of the translations generated.
- Score: 21.76567580425173
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present an empirical study of scaling properties of encoder-decoder
Transformer models used in neural machine translation (NMT). We show that
cross-entropy loss as a function of model size follows a certain scaling law.
Specifically (i) We propose a formula which describes the scaling behavior of
cross-entropy loss as a bivariate function of encoder and decoder size, and
show that it gives accurate predictions under a variety of scaling approaches
and languages; we show that the total number of parameters alone is not
sufficient for such purposes. (ii) We observe different power law exponents
when scaling the decoder vs scaling the encoder, and provide recommendations
for optimal allocation of encoder/decoder capacity based on this observation.
(iii) We also report that the scaling behavior of the model is acutely
influenced by composition bias of the train/test sets, which we define as any
deviation from naturally generated text (either via machine generated or human
translated text). We observe that natural text on the target side enjoys
scaling, which manifests as successful reduction of the cross-entropy loss.
(iv) Finally, we investigate the relationship between the cross-entropy loss
and the quality of the generated translations. We find two different behaviors,
depending on the nature of the test data. For test sets which were originally
translated from target language to source language, both loss and BLEU score
improve as model size increases. In contrast, for test sets originally
translated from source language to target language, the loss improves, but the
BLEU score stops improving after a certain threshold. We release generated text
from all models used in this study.
Related papers
- Predictor-Corrector Enhanced Transformers with Exponential Moving Average Coefficient Learning [73.73967342609603]
We introduce a predictor-corrector learning framework to minimize truncation errors.
We also propose an exponential moving average-based coefficient learning method to strengthen our higher-order predictor.
Our model surpasses a robust 3.8B DeepNet by an average of 2.9 SacreBLEU, using only 1/3 parameters.
arXiv Detail & Related papers (2024-11-05T12:26:25Z) - Autoregressive Speech Synthesis without Vector Quantization [135.4776759536272]
We present MELLE, a novel continuous-valued tokens based language modeling approach for text to speech synthesis (TTS)
MELLE autoregressively generates continuous mel-spectrogram frames directly from text condition.
arXiv Detail & Related papers (2024-07-11T14:36:53Z) - Translatotron-V(ison): An End-to-End Model for In-Image Machine Translation [81.45400849638347]
In-image machine translation (IIMT) aims to translate an image containing texts in source language into an image containing translations in target language.
In this paper, we propose an end-to-end IIMT model consisting of four modules.
Our model achieves competitive performance compared to cascaded models with only 70.9% of parameters, and significantly outperforms the pixel-level end-to-end IIMT model.
arXiv Detail & Related papers (2024-07-03T08:15:39Z) - Tilt your Head: Activating the Hidden Spatial-Invariance of Classifiers [0.7704032792820767]
Deep neural networks are applied in more and more areas of everyday life.
They still lack essential abilities, such as robustly dealing with spatially transformed input signals.
We propose a novel technique to emulate such an inference process for neural nets.
arXiv Detail & Related papers (2024-05-06T09:47:29Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Towards Fine-Grained Information: Identifying the Type and Location of
Translation Errors [80.22825549235556]
Existing approaches can not synchronously consider error position and type.
We build an FG-TED model to predict the textbf addition and textbfomission errors.
Experiments show that our model can identify both error type and position concurrently, and gives state-of-the-art results.
arXiv Detail & Related papers (2023-02-17T16:20:33Z) - Why Does Surprisal From Larger Transformer-Based Language Models Provide
a Poorer Fit to Human Reading Times? [9.909170013118775]
The propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
These results suggest that the propensity of larger Transformer-based models to'memorize' sequences during training makes their surprisal estimates diverge from humanlike expectations.
arXiv Detail & Related papers (2022-12-23T03:57:54Z) - Broken Neural Scaling Laws [9.020652910657931]
Broken Neural Scaling Law (BNSL) accurately models and extrapolates the scaling behaviors of deep neural networks.
This set includes large-scale vision, language, audio, video, diffusion, generative modeling, multimodal learning, contrastive learning, AI alignment, robotics, out-of-distribution (OOD) generalization.
arXiv Detail & Related papers (2022-10-26T17:45:01Z) - Scaling Laws for Autoregressive Generative Modeling [30.051804305320424]
We identify empirical scaling laws for the cross-entropy loss in four domains: generative image modeling, video modeling, multimodal image$leftarrow$text models, and mathematical problem solving.
In all cases autoregressive Transformers smoothly improve in performance as model size and compute budgets increase, following a power-law plus constant scaling law.
arXiv Detail & Related papers (2020-10-28T02:17:24Z) - Understanding Neural Abstractive Summarization Models via Uncertainty [54.37665950633147]
seq2seq abstractive summarization models generate text in a free-form manner.
We study the entropy, or uncertainty, of the model's token-level predictions.
We show that uncertainty is a useful perspective for analyzing summarization and text generation models more broadly.
arXiv Detail & Related papers (2020-10-15T16:57:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.