SOTASTREAM: A Streaming Approach to Machine Translation Training
- URL: http://arxiv.org/abs/2308.07489v1
- Date: Mon, 14 Aug 2023 22:47:19 GMT
- Title: SOTASTREAM: A Streaming Approach to Machine Translation Training
- Authors: Matt Post and Thamme Gowda and Roman Grundkiewicz and Huda Khayrallah
and Rohit Jain and Marcin Junczys-Dowmunt
- Abstract summary: Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer.
We propose an alternative approach that separates the generation of data from the consumption of that data.
In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data.
- Score: 13.39347756245191
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Many machine translation toolkits make use of a data preparation step wherein
raw data is transformed into a tensor format that can be used directly by the
trainer. This preparation step is increasingly at odds with modern research and
development practices because this process produces a static, unchangeable
version of the training data, making common training-time needs difficult
(e.g., subword sampling), time-consuming (preprocessing with large data can
take days), expensive (e.g., disk space), and cumbersome (managing experiment
combinatorics). We propose an alternative approach that separates the
generation of data from the consumption of that data. In this approach, there
is no separate pre-processing step; data generation produces an infinite stream
of permutations of the raw training data, which the trainer tensorizes and
batches as it is consumed. Additionally, this data stream can be manipulated by
a set of user-definable operators that provide on-the-fly modifications, such
as data normalization, augmentation or filtering. We release an open-source
toolkit, SOTASTREAM, that implements this approach:
https://github.com/marian-nmt/sotastream. We show that it cuts training time,
adds flexibility, reduces experiment management complexity, and reduces disk
space, all without affecting the accuracy of the trained models.
Related papers
- High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model.
In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z) - Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes.
Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together.
We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z) - RINAS: Training with Dataset Shuffling Can Be General and Fast [2.485503195398027]
RINAS is a data loading framework that addresses the performance bottleneck of loading global shuffled datasets.
We implement RINAS under the PyTorch framework for common dataset libraries HuggingFace and TorchVision.
Our experimental results show that RINAS improves the throughput of general language model training and vision model training by up to 59% and 89%, respectively.
arXiv Detail & Related papers (2023-12-04T21:50:08Z) - DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning
over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset.
Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z) - Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes.
We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data.
We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z) - On the Transferability of Pre-trained Language Models: A Study from
Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance.
We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z) - Data Aggregation for Reducing Training Data in Symbolic Regression [0.0]
This work discusses methods to reduce the training data and thereby also the runtime of genetic programming.
K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method.
The performance of genetic programming is compared with random forests and linear regression.
arXiv Detail & Related papers (2021-08-24T11:58:17Z) - How Well Self-Supervised Pre-Training Performs with Streaming Data? [73.5362286533602]
In real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming.
It is unclear how well sequential self-supervised pre-training performs with streaming data.
We find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild.
arXiv Detail & Related papers (2021-04-25T06:56:48Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.