Scaling Expert Language Models with Unsupervised Domain Discovery
- URL: http://arxiv.org/abs/2303.14177v1
- Date: Fri, 24 Mar 2023 17:38:58 GMT
- Title: Scaling Expert Language Models with Unsupervised Domain Discovery
- Authors: Suchin Gururangan, Margaret Li, Mike Lewis, Weijia Shi, Tim Althoff,
Noah A. Smith, Luke Zettlemoyer
- Abstract summary: We introduce a simple but effective method to asynchronously train large, sparse language models on arbitrary text corpora.
Our method clusters a corpus into sets of related documents, trains a separate expert language model on each cluster, and combines them in a sparse ensemble for inference.
- Score: 107.08940500543447
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models are typically trained densely: all parameters are
updated with respect to all inputs. This requires synchronization of billions
of parameters across thousands of GPUs. We introduce a simple but effective
method to asynchronously train large, sparse language models on arbitrary text
corpora. Our method clusters a corpus into sets of related documents, trains a
separate expert language model on each cluster, and combines them in a sparse
ensemble for inference. This approach generalizes embarrassingly parallel
training by automatically discovering the domains for each expert, and
eliminates nearly all the communication overhead of existing sparse language
models. Our technique outperforms dense baselines on multiple corpora and
few-shot tasks, and our analysis shows that specializing experts to meaningful
clusters is key to these gains. Performance also improves with the number of
experts and size of training data, suggesting this is a highly efficient and
accessible approach to training large language models.
Related papers
- Layer Swapping for Zero-Shot Cross-Lingual Transfer in Large Language Models [12.424072830053445]
We present a model merging methodology that addresses the difficulty of fine-tuning Large Language Models (LLMs) for target tasks in non-English languages.
We fine-tune separate "experts" on math instruction data in English and on generic instruction data in the target language.
We replace the top and bottom transformer layers of the math expert directly with layers from the language expert, which consequently enhances math performance in the target language.
arXiv Detail & Related papers (2024-10-02T08:53:07Z) - Task-Adaptive Pretrained Language Models via Clustered-Importance Sampling [21.762562172089236]
We build specialist models from large generalist training sets instead.
We adjust the training distribution of the generalist data with guidance from the limited domain-specific data.
It is scalable, suitable for pretraining and continued pretraining, it works well in multi-task settings.
arXiv Detail & Related papers (2024-09-30T20:49:54Z) - Language Models for Text Classification: Is In-Context Learning Enough? [54.869097980761595]
Recent foundational language models have shown state-of-the-art performance in many NLP tasks in zero- and few-shot settings.
An advantage of these models over more standard approaches is the ability to understand instructions written in natural language (prompts)
This makes them suitable for addressing text classification problems for domains with limited amounts of annotated instances.
arXiv Detail & Related papers (2024-03-26T12:47:39Z) - TAMS: Translation-Assisted Morphological Segmentation [3.666125285899499]
We present a sequence-to-sequence model for canonical morpheme segmentation.
Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data.
While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings.
arXiv Detail & Related papers (2024-03-21T21:23:35Z) - Beyond Contrastive Learning: A Variational Generative Model for
Multilingual Retrieval [109.62363167257664]
We propose a generative model for learning multilingual text embeddings.
Our model operates on parallel data in $N$ languages.
We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval.
arXiv Detail & Related papers (2022-12-21T02:41:40Z) - Probing Structured Pruning on Multilingual Pre-trained Models: Settings,
Algorithms, and Efficiency [62.0887259003594]
This work investigates three aspects of structured pruning on multilingual pre-trained language models: settings, algorithms, and efficiency.
Experiments on nine downstream tasks show several counter-intuitive phenomena.
We present Dynamic Sparsification, a simple approach that allows training the model once and adapting to different model sizes at inference.
arXiv Detail & Related papers (2022-04-06T06:29:52Z) - Multitask Prompted Training Enables Zero-Shot Task Generalization [70.12770442071657]
We develop a system for mapping general natural language tasks into a human-readable prompted form.
We fine-tune a pretrained encoder-decoder model on this multitask mixture covering a wide variety of tasks.
The model attains strong zero-shot performance on several standard datasets, often outperforming models 16x its size.
arXiv Detail & Related papers (2021-10-15T17:08:57Z) - Pre-training Universal Language Representation [46.51685959045527]
This work introduces universal language representation learning, i.e., embeddings of different levels of linguistic units or text with quite diverse lengths in a uniform vector space.
We empirically verify that well designed pre-training scheme may effectively yield universal language representation.
arXiv Detail & Related papers (2021-05-30T09:29:01Z) - Beyond English-Centric Multilingual Machine Translation [74.21727842163068]
We create a true Many-to-Many multilingual translation model that can translate directly between any pair of 100 languages.
We build and open source a training dataset that covers thousands of language directions with supervised data, created through large-scale mining.
Our focus on non-English-Centric models brings gains of more than 10 BLEU when directly translating between non-English directions while performing competitively to the best single systems of WMT.
arXiv Detail & Related papers (2020-10-21T17:01:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.