Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
- URL: http://arxiv.org/abs/2502.01804v1
- Date: Mon, 03 Feb 2025 20:33:20 GMT
- Title: Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging
- Authors: Pierre Ablin, Angelos Katharopoulos, Skyler Seto, David Grangier,
- Abstract summary: Soup-of-Experts can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model.
We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly.
- Score: 23.44999968321367
- License:
- Abstract: Machine learning models are routinely trained on a mixture of different data domains. Different domain weights yield very different downstream performances. We propose the Soup-of-Experts, a novel architecture that can instantiate a model at test time for any domain weights with minimal computational cost and without re-training the model. Our architecture consists of a bank of expert parameters, which are linearly combined to instantiate one model. We learn the linear combination coefficients as a function of the input domain weights. To train this architecture, we sample random domain weights, instantiate the corresponding model, and backprop through one batch of data sampled with these domain weights. We demonstrate how our approach obtains small specialized models on several language modeling tasks quickly. Soup-of-Experts are particularly appealing when one needs to ship many different specialist models quickly under a model size constraint.
Related papers
- Learning on Model Weights using Tree Experts [39.90685550999956]
We show how to train neural networks that use other networks as input.
ProbeX is the first probing method specifically designed to learn from the weights of a single model layer.
We demonstrate the effectiveness of ProbeX by predicting the categories in a model's training dataset based only on its weights.
arXiv Detail & Related papers (2024-10-17T17:17:09Z) - Cross-Domain Content Generation with Domain-Specific Small Language Models [3.2772349789781616]
This study explores methods to enable a small language model to produce coherent and relevant outputs for two different domains.
We find that utilizing custom tokenizers tailored to each dataset significantly enhances generation quality.
Our findings demonstrate that knowledge expansion with frozen layers is an effective method for small language models to generate domain-specific content.
arXiv Detail & Related papers (2024-09-19T21:45:13Z) - Revisiting SMoE Language Models by Evaluating Inefficiencies with Task Specific Expert Pruning [78.72226641279863]
Sparse Mixture of Expert (SMoE) models have emerged as a scalable alternative to dense models in language modeling.
Our research explores task-specific model pruning to inform decisions about designing SMoE architectures.
We introduce an adaptive task-aware pruning technique UNCURL to reduce the number of experts per MoE layer in an offline manner post-training.
arXiv Detail & Related papers (2024-09-02T22:35:03Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Heterogeneous Federated Learning Using Knowledge Codistillation [23.895665011884102]
We propose a method that involves training a small model on the entire pool and a larger model on a subset of clients with higher capacity.
The models exchange information bidirectionally via knowledge distillation, utilizing an unlabeled dataset on a server without sharing parameters.
arXiv Detail & Related papers (2023-10-04T03:17:26Z) - The Languini Kitchen: Enabling Language Modelling Research at Different
Scales of Compute [66.84421705029624]
We introduce an experimental protocol that enables model comparisons based on equivalent compute, measured in accelerator hours.
We pre-process an existing large, diverse, and high-quality dataset of books that surpasses existing academic benchmarks in quality, diversity, and document length.
This work also provides two baseline models: a feed-forward model derived from the GPT-2 architecture and a recurrent model in the form of a novel LSTM with ten-fold throughput.
arXiv Detail & Related papers (2023-09-20T10:31:17Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Synthetic Model Combination: An Instance-wise Approach to Unsupervised
Ensemble Learning [92.89846887298852]
Consider making a prediction over new test data without any opportunity to learn from a training set of labelled data.
Give access to a set of expert models and their predictions alongside some limited information about the dataset used to train them.
arXiv Detail & Related papers (2022-10-11T10:20:31Z) - Revealing Secrets From Pre-trained Models [2.0249686991196123]
Transfer-learning has been widely adopted in many emerging deep learning algorithms.
We show that pre-trained models and fine-tuned models have significantly high similarities in weight values.
We propose a new model extraction attack that reveals the model architecture and the pre-trained model used by the black-box victim model.
arXiv Detail & Related papers (2022-07-19T20:19:03Z) - Ensemble Distillation for Robust Model Fusion in Federated Learning [72.61259487233214]
Federated Learning (FL) is a machine learning setting where many devices collaboratively train a machine learning model.
In most of the current training schemes the central model is refined by averaging the parameters of the server model and the updated parameters from the client side.
We propose ensemble distillation for model fusion, i.e. training the central classifier through unlabeled data on the outputs of the models from the clients.
arXiv Detail & Related papers (2020-06-12T14:49:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.