DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
- URL: http://arxiv.org/abs/2305.10429v4
- Date: Tue, 21 Nov 2023 02:01:53 GMT
- Title: DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
- Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng
Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu
- Abstract summary: We propose Domain Reweighting with Minimax Optimization (DoReMi)
DoReMi first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights.
We then resample a dataset with these domain weights and train a larger, full-sized model.
- Score: 148.90031913522648
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books,
web text) greatly affect language model (LM) performance. In this paper, we
propose Domain Reweighting with Minimax Optimization (DoReMi), which first
trains a small proxy model using group distributionally robust optimization
(Group DRO) over domains to produce domain weights (mixture proportions)
without knowledge of downstream tasks. We then resample a dataset with these
domain weights and train a larger, full-sized model. In our experiments, we use
DoReMi on a 280M-parameter proxy model to set the domain weights for training
an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi
improves perplexity across all domains, even when it downweights a domain.
DoReMi improves average few-shot downstream accuracy by 6.5% points over a
baseline model trained using The Pile's default domain weights and reaches the
baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi,
which has no knowledge of downstream tasks, even matches the performance of
using domain weights tuned on downstream tasks.
Related papers
- DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models.
We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations.
Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z) - Does your data spark joy? Performance gains from domain upsampling at the end of training [16.572129046599937]
It is expensive to understand the impact of domain-specific datasets on training at large FL model scales.
We use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks.
This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of lower cost compared to full pretraining runs.
arXiv Detail & Related papers (2024-06-05T17:29:15Z) - ConvLoRA and AdaBN based Domain Adaptation via Self-Training [4.006331916849688]
We propose Convolutional Low-Rank Adaptation (ConvLoRA) for multi-target domain adaptation.
ConvLoRA freezes pre-trained model weights, adds trainable low-rank decomposition matrices to convolutional layers, and backpropagates the gradient.
Our method has fewer trainable parameters and performs better or on-par with large independent fine-tuned networks.
arXiv Detail & Related papers (2024-02-07T15:43:50Z) - DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE)
In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture.
DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z) - AdapterSoup: Weight Averaging to Improve Generalization of Pretrained
Language Models [127.04370753583261]
Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains.
A solution is to use a related-domain adapter for the novel domain at test time.
We introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains.
arXiv Detail & Related papers (2023-02-14T13:09:23Z) - Evaluating Parameter Efficient Learning for Generation [32.52577462253145]
We present comparisons between PERMs and finetuning from three new perspectives.
Our results show that for in-domain settings (a) there is a cross point of sample size for which PERMs will perform better than finetuning when training with fewer samples, and (b) larger PLMs.
We also compare the faithfulness of generations and show that PERMs can achieve better faithfulness score than finetuning, especially for small training set, by as much as 6%.
arXiv Detail & Related papers (2022-10-25T00:14:48Z) - Ensemble of Averages: Improving Model Selection and Boosting Performance
in Domain Generalization [63.28279815753543]
In Domain Generalization (DG) settings, models trained on a given set of training domains have notoriously chaotic performance on shifted test domains.
We first show that a simple protocol for averaging model parameters along the optimization path, starting early during training, significantly boosts domain generalizationity.
We show that an ensemble of independently trained models also has a chaotic behavior in the DG setting.
arXiv Detail & Related papers (2021-10-21T00:08:17Z) - Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised
Pre-Training [67.71228426496013]
We show that using target domain data during pre-training leads to large performance improvements across a variety of setups.
We find that pre-training on multiple domains improves performance generalization on domains not seen during training.
arXiv Detail & Related papers (2021-04-02T12:53:15Z) - Hybrid Generative-Retrieval Transformers for Dialogue Domain Adaptation [77.62366712130196]
We present the winning entry at the fast domain adaptation task of DSTC8, a hybrid generative-retrieval model based on GPT-2 fine-tuned to the multi-domain MetaLWOz dataset.
Our model uses retrieval logic as a fallback, being SoTA on MetaLWOz in human evaluation (>4% improvement over the 2nd place system) and attaining competitive generalization performance in adaptation to the unseen MultiWOZ dataset.
arXiv Detail & Related papers (2020-03-03T18:07:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.