When Ensembling Smaller Models is More Efficient than Single Large
Models
- URL: http://arxiv.org/abs/2005.00570v1
- Date: Fri, 1 May 2020 18:56:18 GMT
- Title: When Ensembling Smaller Models is More Efficient than Single Large
Models
- Authors: Dan Kondratyuk, Mingxing Tan, Matthew Brown, and Boqing Gong
- Abstract summary: We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
- Score: 52.38997176317532
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensembling is a simple and popular technique for boosting evaluation
performance by training multiple models (e.g., with different initializations)
and aggregating their predictions. This approach is commonly reserved for the
largest models, as it is commonly held that increasing the model size provides
a more substantial reduction in error than ensembling smaller models. However,
we show results from experiments on CIFAR-10 and ImageNet that ensembles can
outperform single models with both higher accuracy and requiring fewer total
FLOPs to compute, even when those individual models' weights and
hyperparameters are highly optimized. Furthermore, this gap in improvement
widens as models become large. This presents an interesting observation that
output diversity in ensembling can often be more efficient than training larger
models, especially when the models approach the size of what their dataset can
foster. Instead of using the common practice of tuning a single large model,
one can use ensembles as a more flexible trade-off between a model's inference
speed and accuracy. This also potentially eases hardware design, e.g., an
easier way to parallelize the model across multiple workers for real-time or
distributed inference.
Related papers
- A Collaborative Ensemble Framework for CTR Prediction [73.59868761656317]
We propose a novel framework, Collaborative Ensemble Training Network (CETNet), to leverage multiple distinct models.
Unlike naive model scaling, our approach emphasizes diversity and collaboration through collaborative learning.
We validate our framework on three public datasets and a large-scale industrial dataset from Meta.
arXiv Detail & Related papers (2024-11-20T20:38:56Z) - Dataless Knowledge Fusion by Merging Weights of Language Models [51.8162883997512]
Fine-tuning pre-trained language models has become the prevalent paradigm for building downstream NLP models.
This creates a barrier to fusing knowledge across individual models to yield a better single model.
We propose a dataless knowledge fusion method that merges models in their parameter space.
arXiv Detail & Related papers (2022-12-19T20:46:43Z) - CAMERO: Consistency Regularized Ensemble of Perturbed Language Models
with Weight Sharing [83.63107444454938]
We propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO.
Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity.
Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model.
arXiv Detail & Related papers (2022-04-13T19:54:51Z) - Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z) - Train Large, Then Compress: Rethinking Model Size for Efficient Training
and Inference of Transformers [94.43313684188819]
We study the impact of model size in this setting, focusing on Transformer models for NLP tasks that are limited by compute.
We first show that even though smaller Transformer models execute faster per iteration, wider and deeper models converge in significantly fewer steps.
This leads to an apparent trade-off between the training efficiency of large Transformer models and the inference efficiency of small Transformer models.
arXiv Detail & Related papers (2020-02-26T21:17:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.