Related papers: DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing

URL: http://arxiv.org/abs/2212.03597v3
Date: Sun, 14 Jan 2024 22:14:26 GMT
Title: DeepSpeed Data Efficiency: Improving Deep Learning Model Quality and Training Efficiency via Efficient Data Sampling and Routing
Authors: Conglong Li, Zhewei Yao, Xiaoxia Wu, Minjia Zhang, Connor Holmes, Cheng Li, Yuxiong He
Abstract summary: DeepSpeed Data Efficiency is a framework that makes better use of data, increases training efficiency, and improves model quality. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost, while still maintaining 95% of model quality compared to baseline with full data and cost. For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost.
Score: 57.86954315102865
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances on deep learning models come at the price of formidable training cost. The increasing model size is one of the root causes, but another less-emphasized fact is that data scale is actually increasing at a similar speed as model scale, and the training cost is proportional to both of them. Compared to the rapidly evolving model architecture, how to efficiently use the training data (especially for the expensive foundation model pretraining) is both less explored and difficult to realize due to the lack of a convenient framework that focuses on data efficiency capabilities. To this end, we present DeepSpeed Data Efficiency, a framework that makes better use of data, increases training efficiency, and improves model quality. Specifically, we propose and combine two data efficiency techniques: efficient data sampling via a general curriculum learning library, and efficient data routing via a novel random layerwise token dropping technique. For GPT-3 1.3B language model pretraining, our work achieves 12.5x less data/time/cost (\$3.7K if rent on Azure), while still maintaining 95% of model quality compared to baseline with full data and cost (\$46.3K). For GPT-3 1.3B and BERT-large pretraining, our work can also achieve the same model quality with up to 2x less data/time/cost, or achieve better model quality under same data/time/cost. DeepSpeed Data Efficiency is easy to use and tune, enabling us to easily apply it and verify its benefit on additional tasks including GPT-3 MoE model pretraining and small-scale GPT-2/ViT finetuning.

Related papers

The Delta Learning Hypothesis: Preference Tuning on Weak Data can Yield Strong Gains [50.66245575710432]
We show that paired preference data consisting of individually weak data points can enable gains beyond the strength of each individual data point.<n>Our work shows that models can learn surprisingly well from paired data that might typically be considered weak.
arXiv Detail & Related papers (2025-07-08T17:14:44Z)
Quality over Quantity: An Effective Large-Scale Data Reduction Strategy Based on Pointwise V-Information [2.133855532092057]
We propose an effective data reduction strategy based on Pointwise - Information (PVI)<n>Experiments show that the classifier performance is maintained with only a 0.0001% to 0.76% reduction in accuracy when 10%-30% of the data is removed.<n>We have adapted the PVI framework, which was previously limited to English datasets, to a variety of Chinese NLP tasks and base models.
arXiv Detail & Related papers (2025-06-19T06:59:19Z)
Collaborative Unlabeled Data Optimization [6.512302544770766]
This paper pioneers a novel data-centric paradigm to maximize the utility of unlabeled data.<n>By distributing unlabeled data and leveraging publicly available task-agnostic models, CoOpt facilitates scalable, reusable, and sustainable training pipelines.
arXiv Detail & Related papers (2025-05-20T09:21:40Z)
Data Whisperer: Efficient Data Selection for Task-Specific LLM Fine-Tuning via Few-Shot In-Context Learning [40.19639581728674]
Fine-tuning large language models (LLMs) on task-specific data is essential for their effective deployment.<n>We propose Data Whisperer, an efficient, training-free, attention-based method that leverages few-shot in-context learning with the model to be fine-tuned.<n>Data Whisperer achieves superior performance compared to the full GSM8K dataset on the Llama-3-8B-Instruct model, using just 10% of the data, and outperforms existing methods with a 3.1-point improvement and a 7.4$times$ speedup.
arXiv Detail & Related papers (2025-05-18T03:10:00Z)
YuLan-Mini: An Open Data-efficient Language Model [111.02822724500552]
YuLan-Mini, a highly capable base model with 2.42B parameters, achieves top-tier performance among models of similar parameter scale. Remarkably, YuLan-Mini, trained on 1.08T tokens, achieves performance comparable to industry-leading models that require significantly more data.
arXiv Detail & Related papers (2024-12-23T17:47:53Z)
Efficient Federated Learning Using Dynamic Update and Adaptive Pruning with Momentum on Shared Server Data [59.6985168241067]
Federated Learning (FL) encounters two important problems, i.e., low training efficiency and limited computational resources. We propose a new FL framework, FedDUMAP, to leverage the shared insensitive data on the server and the distributed data in edge devices. Our proposed FL model, FedDUMAP, combines the three original techniques and has a significantly better performance compared with baseline approaches.
arXiv Detail & Related papers (2024-08-11T02:59:11Z)
AutoScale: Automatic Prediction of Compute-optimal Data Composition for Training LLMs [61.13296177652599]
This paper demonstrates that the optimal composition of training data from different domains is scale-dependent. We introduce *AutoScale*, a novel, practical approach for optimizing data compositions at potentially large training data scales. Our evaluation on GPT-2 Large and BERT pre-training demonstrates *AutoScale*'s effectiveness in improving training convergence and downstream performance.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling [27.975832264345772]
We propose Web Rephrase Augmented Pre-training ($textbfWRAP$) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web. We show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by $sim3x$. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%.
arXiv Detail & Related papers (2024-01-29T18:19:08Z)
Effective pruning of web-scale datasets based on complexity of concept clusters [48.125618324485195]
We present a method for pruning large-scale multimodal datasets for training CLIP-style models on ImageNet. We find that training on a smaller set of high-quality data can lead to higher performance with significantly lower training costs. We achieve a new state-of-the-art Imagehttps://info.arxiv.org/help/prep#commentsNet zero-shot accuracy and a competitive average zero-shot accuracy on 38 evaluation tasks.
arXiv Detail & Related papers (2024-01-09T14:32:24Z)
Efficient Grammatical Error Correction Via Multi-Task Training and Optimized Training Schedule [55.08778142798106]
We propose auxiliary tasks that exploit the alignment between the original and corrected sentences. We formulate each task as a sequence-to-sequence problem and perform multi-task training. We find that the order of datasets used for training and even individual instances within a dataset may have important effects on the final performance.
arXiv Detail & Related papers (2023-11-20T14:50:12Z)
D4: Improving LLM Pretraining via Document De-Duplication and Diversification [38.84592304799403]
We show that careful data selection via pre-trained model embeddings can speed up training. We also show that repeating data intelligently consistently outperforms baseline training.
arXiv Detail & Related papers (2023-08-23T17:58:14Z)
Recommendation Unlearning via Influence Function [42.4931807753579]
We propose a new Influence Function-based Recommendation Unlearning (IFRU) framework, which efficiently updates the model without retraining. IFRU achieves more than 250 times acceleration compared to retraining-based methods with recommendation performance comparable to full retraining.
arXiv Detail & Related papers (2023-07-05T09:42:51Z)
Knowledge Distillation as Efficient Pre-training: Faster Convergence, Higher Data-efficiency, and Better Transferability [53.27240222619834]
Knowledge Distillation as Efficient Pre-training aims to efficiently transfer the learned feature representation from pre-trained models to new student models for future downstream tasks. Our method performs comparably with supervised pre-training counterparts in 3 downstream tasks and 9 downstream datasets requiring 10x less data and 5x less pre-training time.
arXiv Detail & Related papers (2022-03-10T06:23:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.