DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing
- URL: http://arxiv.org/abs/2212.03597v3
- Date: Sun, 14 Jan 2024 22:14:26 GMT
- Title: DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and
Training Efficiency via Efficient Data Sampling and Routing
- Authors: Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes,
Cheng Li, Yuxiong He
- Abstract summary: DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality.
For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost.
For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
- Score: 57.86954315102865
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances on deep learning models come at the price of formidable
training cost. The increasing model size is one of the root causes, but another
less-emphasized fact is that data scale is actually increasing at a similar
speed as model scale, and the training cost is proportional to both of them.
Compared to the rapidly evolving model architecture, how to efficiently use the
training data (especially for the expensive foundation model pretraining) is
both less explored and difficult to realize due to the lack of a convenient
framework that focuses on data efficiency capabilities. To this end, we present
DeepSpeed Data Efficiency, a framework that makes better use of data, increases
training efficiency, and improves model quality. Specifically, we propose and
combine two data efficiency techniques: efficient data sampling via a general
curriculum learning library, and efficient data routing via a novel random
layerwise token dropping technique. For GPT-3 1.3B language model pretraining,
our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while
still maintaining 95% of model quality compared to baseline with full data and
cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also
achieve the same model quality with up to 2x less data/time/cost, or achieve
better model quality under same data/time/cost. DeepSpeed Data Efficiency is
easy to use and tune, enabling us to easily apply it and verify its benefit on
additional tasks including GPT-3 MoE model pretraining and small-scale
GPT-2/ViT finetuning.
Related papers
- Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data [59.6985168241067]
Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources.
We propose a new FL framework, FedDUMAP, to leverage the shared insensitive data on the server and the distributed data in edge devices.
Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches.
arXiv Detail & Related papers (2024-08-11T02:59:11Z) - AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent.
We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales.
Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z) - Rephrasing the Web: A Recipe for Compute and Data-Efficient Language
Modeling [27.975832264345772]
We propose Web Rephrase Augmented Pre-training ($textbfWRAP$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web.
We show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $sim3x$.
At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.
arXiv Detail & Related papers (2024-01-29T18:19:08Z) - Effective pruning of web-scale datasets based on complexity of concept
clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet.
We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs.
We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z) - Efficient Grammatical Error Correction Via Multi-Task Training and
Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences.
We formulate each task as a sequence-to-sequence problem and perform multi-task training.
We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z) - D4: Improving LLM Pretraining via Document De-Duplication and
Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training.
We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z) - Recommendation Unlearning via Influence Function [42.4931807753579]
We propose a new Influence Function-based Recommendation Unlearning (IFRU) framework, which efficiently updates the model without retraining.
IFRU achieves more than 250 times acceleration compared to retraining-based methods with recommendation performance comparable to full retraining.
arXiv Detail & Related papers (2023-07-05T09:42:51Z) - Knowledge Distillation as Efficient Pre-training: Faster Convergence,
Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks.
Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.