Related papers: SOTASTREAM: A Streaming Approach to Machine Translation Training

SOTASTREAM: A Streaming Approach to Machine Translation Training

URL: http://arxiv.org/abs/2308.07489v1
Date: Mon, 14 Aug 2023 22:47:19 GMT
Title: SOTASTREAM: A Streaming Approach to Machine Translation Training
Authors: Matt Post and Thamme Gowda and Roman Grundkiewicz and Huda Khayrallah and Rohit Jain and Marcin Junczys-Dowmunt
Abstract summary: Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data.
Score: 13.39347756245191
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Many machine translation toolkits make use of a data preparation step wherein raw data is transformed into a tensor format that can be used directly by the trainer. This preparation step is increasingly at odds with modern research and development practices because this process produces a static, unchangeable version of the training data, making common training-time needs difficult (e.g., subword sampling), time-consuming (preprocessing with large data can take days), expensive (e.g., disk space), and cumbersome (managing experiment combinatorics). We propose an alternative approach that separates the generation of data from the consumption of that data. In this approach, there is no separate pre-processing step; data generation produces an infinite stream of permutations of the raw training data, which the trainer tensorizes and batches as it is consumed. Additionally, this data stream can be manipulated by a set of user-definable operators that provide on-the-fly modifications, such as data normalization, augmentation or filtering. We release an open-source toolkit, SOTASTREAM, that implements this approach: https://github.com/marian-nmt/sotastream. We show that it cuts training time, adds flexibility, reduces experiment management complexity, and reduces disk space, all without affecting the accuracy of the trained models.

Related papers

Efficient Learning Under Density Shift in Incremental Settings Using Cramér-Rao-Based Regularization [0.0]
This work takes a distributed density estimation angle to the problem where data are temporally distributed. It processes data in batches and allows a neural network to treat a batch as training data. $C2A$ achieves $19%$ accuracy at maximum against state-of-the-art methods.
arXiv Detail & Related papers (2025-02-18T16:00:10Z)
High-Dimensional Distributed Sparse Classification with Scalable Communication-Efficient Global Updates [50.406127962933915]
We develop solutions to problems which enable us to learn a communication-efficient distributed logistic regression model. In our experiments we demonstrate a large improvement in accuracy over distributed algorithms with only a few distributed update steps needed.
arXiv Detail & Related papers (2024-07-08T19:34:39Z)
Online Data Augmentation for Forecasting with Deep Learning [0.33554367023486936]
This work introduces an online data augmentation framework that generates synthetic samples during the training of neural networks. We maintain a balanced representation between real and synthetic data throughout the training process. Experiments suggest that online data augmentation leads to better forecasting performance compared to offline data augmentation or no augmentation approaches.
arXiv Detail & Related papers (2024-04-25T17:16:13Z)
Modyn: Data-Centric Machine Learning Pipeline Orchestration [1.4448995242976572]
Modyn is a data-centric end-to-end machine learning platform. We present Modyn, a data-centric end-to-end machine learning platform.
arXiv Detail & Related papers (2023-12-11T09:50:52Z)
Efficient Online Data Mixing For Language Model Pre-Training [101.45242332613944]
Existing data selection methods suffer from slow and computationally expensive processes. Data mixing, on the other hand, reduces the complexity of data selection by grouping data points together. We develop an efficient algorithm for Online Data Mixing (ODM) that combines elements from both data selection and data mixing.
arXiv Detail & Related papers (2023-12-05T00:42:35Z)
DiffPrep: Differentiable Data Preprocessing Pipeline Search for Learning over Tabular Data [12.416345241511781]
We propose DiffPrep to automatically and efficiently search for a data preprocessing pipeline for a given dataset. Our experiments show that DiffPrep achieves the best test accuracy on 15 out of the 18 real-world datasets evaluated.
arXiv Detail & Related papers (2023-08-20T23:40:26Z)
Scaling Data-Constrained Language Models [137.17302576977346]
We investigate scaling language models in data-constrained regimes. We find that with constrained data for a fixed compute budget, training with up to 4 epochs of repeated data yields negligible changes to loss compared to having unique data. We propose and empirically validate a scaling law for compute optimality that accounts for the decreasing value of repeated tokens and excess parameters.
arXiv Detail & Related papers (2023-05-25T17:18:55Z)
On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [74.11825654535895]
Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance. We study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks.
arXiv Detail & Related papers (2021-09-08T10:39:57Z)
Data Aggregation for Reducing Training Data in Symbolic Regression [0.0]
This work discusses methods to reduce the training data and thereby also the runtime of genetic programming. K-means clustering and data binning is used for data aggregation and compared with random sampling as the simplest data reduction method. The performance of genetic programming is compared with random forests and linear regression.
arXiv Detail & Related papers (2021-08-24T11:58:17Z)
How Well Self-Supervised Pre-Training Performs with Streaming Data? [73.5362286533602]
In real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming. It is unclear how well sequential self-supervised pre-training performs with streaming data. We find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild.
arXiv Detail & Related papers (2021-04-25T06:56:48Z)
Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training. We experimentally verify that the new dataset can significantly improve the ability of the learned FER model. To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.