CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing
- URL: http://arxiv.org/abs/2204.06625v1
- Date: Wed, 13 Apr 2022 19:54:51 GMT
- Title: CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing
- Authors: Chen Liang, Pengcheng He, Yelong Shen, Weizhu Chen, Tuo Zhao
- Abstract summary: We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
- Score: 83.63107444454938
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Model ensemble is a popular approach to produce a low-variance and
well-generalized model. However, it induces large memory and inference costs,
which are often not affordable for real-world deployment. Existing work has
resorted to sharing weights among models. However, when increasing the
proportion of the shared weights, the resulting models tend to be similar, and
the benefits of using model ensemble diminish. To retain ensemble benefits
while maintaining a low memory cost, we propose a consistency-regularized
ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply
different perturbations to the hidden representations for different models,
which can effectively promote the model diversity. Meanwhile, we apply a
prediction consistency regularizer across the perturbed models to control the
variance due to the model diversity. Our experiments using large language
models demonstrate that CAMERO significantly improves the generalization
performance of the ensemble model. Specifically, CAMERO outperforms the
standard ensemble of 8 BERT-base models on the GLUE benchmark by 0.7 with a
significantly smaller model size (114.2M vs. 880.6M).
Related papers
- EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Continuous Language Model Interpolation for Dynamic and Controllable Text Generation [7.535219325248997]
We focus on the challenging case where the model must dynamically adapt to diverse -- and often changing -- user preferences.
We leverage adaptation methods based on linear weight, casting them as continuous multi-domain interpolators.
We show that varying the weights yields predictable and consistent change in the model outputs.
arXiv Detail & Related papers (2024-04-10T15:55:07Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - Model soups: averaging weights of multiple fine-tuned models improves
accuracy without increasing inference time [69.7693300927423]
We show that averaging the weights of multiple models fine-tuned with different hyper parameter configurations improves accuracy and robustness.
We show that the model soup approach extends to multiple image classification and natural language processing tasks.
arXiv Detail & Related papers (2022-03-10T17:03:49Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.