Related papers: DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

URL: http://arxiv.org/abs/2305.10429v4
Date: Tue, 21 Nov 2023 02:01:53 GMT
Title: DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining
Authors: Sang Michael Xie, Hieu Pham, Xuanyi Dong, Nan Du, Hanxiao Liu, Yifeng Lu, Percy Liang, Quoc V. Le, Tengyu Ma, Adams Wei Yu
Abstract summary: We propose Domain Reweighting with Minimax Optimization (DoReMi) DoReMi first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights. We then resample a dataset with these domain weights and train a larger, full-sized model.
Score: 148.90031913522648
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

Related papers

Commute Your Domains: Trajectory Optimality Criterion for Multi-Domain Learning [50.80758278865274]
In multi-domain learning, a single model is trained on diverse data domains to leverage shared knowledge and improve generalization. The order in which the data from these domains is used for training can significantly affect the model's performance on each domain. We investigate the influence of training order (or data mixing) in multi-domain learning using the concept of Lie bracket of gradient vector fields.
arXiv Detail & Related papers (2025-01-26T15:12:06Z)
DoPAMine: Domain-specific Pre-training Adaptation from seed-guided data Mining [2.1534028009401713]
Large Language Models (LLMs) have shown ability to generalize effectively across numerous industry domains. LLMs exhibit limitations when tasked with performing in specialized or low-resource industry domains. In this work, we propose an automated and scalable framework - DoPAMine:Domain-specific Pre-training Adaptation from seed-guided data Mining.
arXiv Detail & Related papers (2024-09-30T22:15:58Z)
Re-Mix: Optimizing Data Mixtures for Large Scale Imitation Learning [25.359270812682155]
We investigate how to weigh different subsets or domains'' of robotics datasets for robot foundation model pre-training. Our method, Re-Mix, addresses the wide range of challenges that arise when applying DRO to robotics datasets.
arXiv Detail & Related papers (2024-08-26T06:14:25Z)
AutoScale: Scale-Aware Data Mixing for Pre-Training LLMs [61.13296177652599]
We show that data mixtures that perform well at smaller scales may not retain their advantage at larger scales. We propose AutoScale, a two-stage, scale-aware data composition framework.
arXiv Detail & Related papers (2024-07-29T17:06:30Z)
DataComp-LM: In search of the next generation of training sets for language models [200.5293181577585]
DataComp for Language Models (DCLM) is a testbed for controlled dataset experiments with the goal of improving language models. We provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters.
arXiv Detail & Related papers (2024-06-17T17:42:57Z)
Does your data spark joy? Performance gains from domain upsampling at the end of training [16.572129046599937]
It is expensive to understand the impact of domain-specific datasets on training at large FL model scales. We use domain upsampling to characterize at scale the utility of individual datasets for improving various benchmarks. This tool opens up the ability to experiment with the impact of different pretraining datasets at scale, but at an order of lower cost compared to full pretraining runs.
arXiv Detail & Related papers (2024-06-05T17:29:15Z)
DoGE: Domain Reweighting with Generalization Estimation [42.32000165235568]
We propose DOmain reweighting with Generalization Estimation (DoGE) In our experiments, we extensively show how DoGE improves the generalization of the base model to any target data mixture. DoGE can effectively identify inter-domain dependencies, and consistently achieves better test perplexity on the target domain.
arXiv Detail & Related papers (2023-10-23T22:51:58Z)
AdapterSoup: Weight Averaging to Improve Generalization of Pretrained Language Models [127.04370753583261]
Pretrained language models (PLMs) are trained on massive corpora, but often need to specialize to specific domains. A solution is to use a related-domain adapter for the novel domain at test time. We introduce AdapterSoup, an approach that performs weight-space averaging of adapters trained on different domains.
arXiv Detail & Related papers (2023-02-14T13:09:23Z)
Ensemble of Averages: Improving Model Selection and Boosting Performance in Domain Generalization [63.28279815753543]
In Domain Generalization (DG) settings, models trained on a given set of training domains have notoriously chaotic performance on shifted test domains. We first show that a simple protocol for averaging model parameters along the optimization path, starting early during training, significantly boosts domain generalizationity. We show that an ensemble of independently trained models also has a chaotic behavior in the DG setting.
arXiv Detail & Related papers (2021-10-21T00:08:17Z)
Robust wav2vec 2.0: Analyzing Domain Shift in Self-Supervised Pre-Training [67.71228426496013]
We show that using target domain data during pre-training leads to large performance improvements across a variety of setups. We find that pre-training on multiple domains improves performance generalization on domains not seen during training.
arXiv Detail & Related papers (2021-04-02T12:53:15Z)
Hybrid Generative-Retrieval Transformers for Dialogue Domain Adaptation [77.62366712130196]
We present the winning entry at the fast domain adaptation task of DSTC8, a hybrid generative-retrieval model based on GPT-2 fine-tuned to the multi-domain MetaLWOz dataset. Our model uses retrieval logic as a fallback, being SoTA on MetaLWOz in human evaluation (>4% improvement over the 2nd place system) and attaining competitive generalization performance in adaptation to the unseen MultiWOZ dataset.
arXiv Detail & Related papers (2020-03-03T18:07:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.