Related papers: No Need to Talk: Asynchronous Mixture of Language Models

No Need to Talk: Asynchronous Mixture of Language Models

URL: http://arxiv.org/abs/2410.03529v1
Date: Fri, 4 Oct 2024 15:50:10 GMT
Title: No Need to Talk: Asynchronous Mixture of Language Models
Authors: Anastasiia Filippova, Angelos Katharopoulos, David Grangier, Ronan Collobert,
Abstract summary: SmallTalk LM is an innovative method for training a mixture of language models in an almost asynchronous manner. We show that SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost.
Score: 25.3581396758015
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce SmallTalk LM, an innovative method for training a mixture of language models in an almost asynchronous manner. Each model of the mixture specializes in distinct parts of the data distribution, without the need of high-bandwidth communication between the nodes training each model. At inference, a lightweight router directs a given sequence to a single expert, according to a short prefix. This inference scheme naturally uses a fraction of the parameters from the overall mixture model. Our experiments on language modeling demonstrate tha SmallTalk LM achieves significantly lower perplexity than dense model baselines for the same total training FLOPs and an almost identical inference cost. Finally, in our downstream evaluations we outperform the dense baseline on $75\%$ of the tasks.

Related papers

One-Shot Federated Learning with Classifier-Free Diffusion Models [7.338353383261602]
One-shot federated learning (OSFL) addresses this by forming a global model with a single communication round. OSCAR is a simple yet cost-effective OSFL approach that outperforms the state-of-the-art on four datasets while reducing the communication load by at least 99%.
arXiv Detail & Related papers (2025-02-12T15:23:29Z)
ModelMix: A New Model-Mixup Strategy to Minimize Vicinal Risk across Tasks for Few-scribble based Cardiac Segmentation [32.19827368497988]
We introduce a new approach to few-scribble supervised segmentation based on model parameter, termed as ModelMix. ModelMix constructs virtual models using convex combinations of convolutional parameters from separate encoders. We then regularize the model set to minimize vicinal risk across tasks in both unsupervised and scribble-supervised way.
arXiv Detail & Related papers (2024-06-19T05:58:11Z)
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance [55.872926690722714]
We study the predictability of model performance regarding the mixture proportions in function forms. We propose nested use of the scaling laws of training steps, model sizes, and our data mixing law. Our method effectively optimize the training mixture of a 1B model trained for 100B tokens in RedPajama.
arXiv Detail & Related papers (2024-03-25T17:14:00Z)
AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling [115.89786751297348]
We introduce AnyGPT, an any-to-any multimodal language model that utilizes discrete representations for the unified processing of various modalities. We build a multimodal text-centric dataset for multimodal alignment pre-training. We show that AnyGPT is capable of facilitating any-to-any multimodal conversation while achieving performance comparable to specialized models across all modalities.
arXiv Detail & Related papers (2024-02-19T15:33:10Z)
Federated Learning with Projected Trajectory Regularization [65.6266768678291]
Federated learning enables joint training of machine learning models from distributed clients without sharing their local data. One key challenge in federated learning is to handle non-identically distributed data across the clients. We propose a novel federated learning framework with projected trajectory regularization (FedPTR) for tackling the data issue.
arXiv Detail & Related papers (2023-12-22T02:12:08Z)
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models [125.91897197446379]
We find that MoE models benefit more from instruction tuning than dense models. Our most powerful model, FLAN-MOE-32B, surpasses the performance of FLAN-PALM-62B on four benchmark tasks.
arXiv Detail & Related papers (2023-05-24T04:22:26Z)
Scaling Laws for Generative Mixed-Modal Language Models [103.25737824352949]
We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities.
arXiv Detail & Related papers (2023-01-10T00:20:06Z)
Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models. This creates a barrier to fusing knowledge across individual models to yield a better single model. We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z)
Model ensemble instead of prompt fusion: a sample-specific knowledge transfer method for few-shot prompt tuning [85.55727213502402]
We focus on improving the few-shot performance of prompt tuning by transferring knowledge from soft prompts of source tasks. We propose Sample-specific Ensemble of Source Models (SESoM) SESoM learns to adjust the contribution of each source model for each target sample separately when ensembling source model outputs.
arXiv Detail & Related papers (2022-10-23T01:33:16Z)
One-shot Federated Learning without Server-side Training [42.59845771101823]
One-shot federated learning is gaining popularity as a way to reduce communication cost between clients and the server. Most of the existing one-shot FL methods are based on Knowledge Distillation; however, distillation based approach requires an extra training phase and depends on publicly available data sets or generated pseudo samples. In this work, we consider a novel and challenging cross-silo setting: performing a single round of parameter aggregation on the local models without server-side training.
arXiv Detail & Related papers (2022-04-26T01:45:37Z)
Improving Rare Word Recognition with LM-aware MWER Training [50.241159623691885]
We introduce LMs in the learning of hybrid autoregressive transducer (HAT) models in the discriminative training framework. For the shallow fusion setup, we use LMs during both hypotheses generation and loss computation, and the LM-aware MWER-trained model achieves 10% relative improvement. For the rescoring setup, we learn a small neural module to generate per-token fusion weights in a data-dependent manner.
arXiv Detail & Related papers (2022-04-15T17:19:41Z)
Multi-stage Pre-training over Simplified Multimodal Pre-training Models [35.644196343835674]
We propose a new Multi-stage Pre-training (MSP) method, which uses information at different granularities from word, phrase to sentence in both texts and images to pre-train the model in stages. We also design several different pre-training tasks suitable for the information granularity in different stage in order to efficiently capture the diverse knowledge from a limited corpus. Experimental results show that our method achieves comparable performance to the original LXMERT model in all downstream tasks, and even outperforms the original model in Image-Text Retrieval task.
arXiv Detail & Related papers (2021-07-22T03:35:27Z)
Early Stage LM Integration Using Local and Global Log-Linear Combination [46.91755970827846]
Sequence-to-sequence models with an implicit alignment mechanism (e.g. attention) are closing the performance gap towards traditional hybrid hidden Markov models (HMM) One important factor to improve word error rate in both cases is the use of an external language model (LM) trained on large text-only corpora. We present a novel method for language model integration into implicit-alignment based sequence-to-sequence models.
arXiv Detail & Related papers (2020-05-20T13:49:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.