Related papers: Data Scaling Laws in NMT: The Effect of Noise and Architecture

Data Scaling Laws in NMT: The Effect of Noise and Architecture

URL: http://arxiv.org/abs/2202.01994v1
Date: Fri, 4 Feb 2022 06:53:49 GMT
Title: Data Scaling Laws in NMT: The Effect of Noise and Architecture
Authors: Yamini Bansal, Behrooz Ghorbani, Ankush Garg, Biao Zhang, Maxim Krikun, Colin Cherry, Behnam Neyshabur, Orhan Firat
Abstract summary: We study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT) We find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data.
Score: 59.767899982937756
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In this work, we study the effect of varying the architecture and training data quality on the data scaling properties of Neural Machine Translation (NMT). First, we establish that the test loss of encoder-decoder transformer models scales as a power law in the number of training samples, with a dependence on the model size. Then, we systematically vary aspects of the training setup to understand how they impact the data scaling laws. In particular, we change the following (1) Architecture and task setup: We compare to a transformer-LSTM hybrid, and a decoder-only transformer with a language modeling loss (2) Noise level in the training distribution: We experiment with filtering, and adding iid synthetic noise. In all the above cases, we find that the data scaling exponents are minimally impacted, suggesting that marginally worse architectures or training data can be compensated for by adding more data. Lastly, we find that using back-translated data instead of parallel data, can significantly degrade the scaling exponent.

Related papers

LLMs on the Line: Data Determines Loss-to-Loss Scaling Laws [21.053622641336744]
Loss-to-loss scaling laws relate losses across pretraining datasets and downstream tasks. Our experiments reveal that the pretraining data and tokenizer determine the scaling trend.
arXiv Detail & Related papers (2025-02-17T18:45:25Z)
Understanding Scaling Laws with Statistical and Approximation Theory for Transformer Neural Networks on Intrinsically Low-dimensional Data [4.481230230086981]
In deep neural networks, a model's generalization error is often observed to follow a power scaling law dependent both on the model size and the data size. We show that our theory predicts a power law between the generalization error and both the training data size and the network size for transformers. By leveraging low-dimensional data structures under a manifold hypothesis, we are able to explain transformer scaling laws in a way which respects the data geometry.
arXiv Detail & Related papers (2024-11-11T01:05:28Z)
How Does Data Diversity Shape the Weight Landscape of Neural Networks? [2.89287673224661]
We investigate the impact of dropout, weight decay, and noise augmentation on the parameter space of neural networks. We observe that diverse data influences the weight landscape in a similar fashion as dropout. We conclude that synthetic data can bring more diversity into real input data, resulting in a better performance on out-of-distribution test instances.
arXiv Detail & Related papers (2024-10-18T16:57:05Z)
When More Data Hurts: Optimizing Data Coverage While Mitigating Diversity Induced Underfitting in an Ultra-Fast Machine-Learned Potential [0.0]
This study investigates how training data diversity affects the performance of machine-learned interatomic potentials (MLIPs) We employ expert and autonomously generated data to create the training data and fit four force-field variants to subsets of the data. Our findings reveal a critical balance in training data diversity: insufficient diversity hinders generalization, while excessive diversity can exceed the MLIP's learning capacity.
arXiv Detail & Related papers (2024-09-11T20:45:44Z)
Scaling Retrieval-Based Language Models with a Trillion-Token Datastore [85.4310806466002]
We find that increasing the size of the datastore used by a retrieval-based LM monotonically improves language modeling and several downstream tasks without obvious saturation. By plotting compute-optimal scaling curves with varied datastore, model, and pretraining data sizes, we show that using larger datastores can significantly improve model performance for the same training compute budget.
arXiv Detail & Related papers (2024-07-09T08:27:27Z)
Learning with Noisy Foundation Models [95.50968225050012]
This paper is the first work to comprehensively understand and analyze the nature of noise in pre-training datasets. We propose a tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise and improve generalization.
arXiv Detail & Related papers (2024-03-11T16:22:41Z)
Understanding and Mitigating the Label Noise in Pre-training on Downstream Tasks [91.15120211190519]
This paper aims to understand the nature of noise in pre-training datasets and to mitigate its impact on downstream tasks. We propose a light-weight black-box tuning method (NMTune) to affine the feature space to mitigate the malignant effect of noise.
arXiv Detail & Related papers (2023-09-29T06:18:15Z)
Dataset Quantization [72.61936019738076]
We present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets. DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio.
arXiv Detail & Related papers (2023-08-21T07:24:29Z)
The Lie Derivative for Measuring Learned Equivariance [84.29366874540217]
We study the equivariance properties of hundreds of pretrained models, spanning CNNs, transformers, and Mixer architectures. We find that many violations of equivariance can be linked to spatial aliasing in ubiquitous network layers, such as pointwise non-linearities. For example, transformers can be more equivariant than convolutional neural networks after training.
arXiv Detail & Related papers (2022-10-06T15:20:55Z)
An Empirical Evaluation of the t-SNE Algorithm for Data Visualization in Structural Engineering [2.4493299476776773]
t-Distributed Neighbor Embedding (t-SNE) algorithm is used to reduce the dimensions of an earthquake related data set for visualization purposes. Synthetic Minority Oversampling Technique (SMOTE) is used to tackle the imbalanced nature of such data set. We show that using t-SNE on the imbalanced data and SMOTE on the training data set, neural network classifiers have promising results without sacrificing accuracy.
arXiv Detail & Related papers (2021-09-18T01:24:39Z)
Scaling Laws for Transfer [0.5432984841650929]
We study scaling laws for transfer learning between distributions in an unsupervised, fine-tuning setting. We find that the effective data transferred is described well in the low data regime by a power-law of parameter count and fine-tuning dataset size.
arXiv Detail & Related papers (2021-02-02T04:07:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.