No Need to Talk: Asynchronous Mixture of Language Models
- URL: http://arxiv.org/abs/2410.03529v1
- Date: Fri, 4 Oct 2024 15:50:10 GMT
- Title: No Need to Talk: Asynchronous Mixture of Language Models
- Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert,
- Abstract summary: SmallTalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner.
We show that SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost.
- Score: 25.3581396758015
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on $75\%$ of the tasks.
Related papers
- ModelMix: A New Model-Mixup Strategy to Minimize Vicinal Risk across Tasks for Few-scribble based Cardiac Segmentation [32.19827368497988]
We introduce a new approach to few-scribble supervised segmentation based on model parameter, termed as ModelMix.
ModelMix constructs virtual models using convex combinations of convolutional parameters from separate encoders.
We then regularize the model set to minimize vicinal risk across tasks in both unsupervised and scribble-supervised way.
arXiv Detail & Related papers (2024-06-19T05:58:11Z) - AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities.
We build a multimodal text-centric dataset for multimodal alignment pre-training.
We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z) - Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for
Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models.
Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z) - Scaling Laws for Generative Mixed-Modal Language Models [103.25737824352949]
We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them.
Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws.
We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities.
arXiv Detail & Related papers (2023-01-10T00:20:06Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Model ensemble instead of prompt fusion: a sample-specific knowledge
transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks.
We propose Sample-specific Ensemble of Source Models (SESoM)
SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z) - Multi-stage Pre-training over Simplified Multimodal Pre-training Models [35.644196343835674]
We propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages.
We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus.
Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.
arXiv Detail & Related papers (2021-07-22T03:35:27Z) - Early Stage LM Integration Using Local and Global Log-Linear Combination [46.91755970827846]
Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM)
One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora.
We present a novel method for language model integration into implicit-alignment based sequence-to-sequence models.
arXiv Detail & Related papers (2020-05-20T13:49:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.