Predicting on the Edge: Identifying Where a Larger Model Does Better
- URL: http://arxiv.org/abs/2202.07652v1
- Date: Tue, 15 Feb 2022 18:53:14 GMT
- Title: Predicting on the Edge: Identifying Where a Larger Model Does Better
- Authors: Taman Narayan, Heinrich Jiang, Sen Zhao, Sanjiv Kumar
- Abstract summary: We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
- Score: 61.793778186198864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Much effort has been devoted to making large and more accurate models, but
relatively little has been put into understanding which examples are benefiting
from the added complexity. In this paper, we demonstrate and analyze the
surprisingly tight link between a model's predictive uncertainty on individual
examples and the likelihood that larger models will improve prediction on them.
Through extensive numerical studies on the T5 encoder-decoder architecture, we
show that large models have the largest improvement on examples where the small
model is most uncertain. On more certain examples, even those where the small
model is not particularly accurate, large models are often unable to improve at
all, and can even perform worse than the smaller model. Based on these
findings, we show that a switcher model which defers examples to a larger model
when a small model is uncertain can achieve striking improvements in
performance and resource usage. We also explore committee-based uncertainty
metrics that can be more effective but less practical.
Related papers
- A Hitchhiker's Guide to Scaling Law Estimation [56.06982415792523]
Scaling laws predict the loss of a target machine learning model by extrapolating from easier-to-train models with fewer parameters or smaller training sets.
We estimate more than 1000 scaling laws, then derive a set of best practices for estimating scaling laws in new model families.
arXiv Detail & Related papers (2024-10-15T17:59:10Z) - Effects of Scale on Language Model Robustness [7.725206196110384]
We show that adversarially trained larger models generalize faster and better to modified attacks not seen during training when compared with smaller models.
We also analyze the offense/defense balance of increasing compute, finding parity in some settings and an advantage for offense in others.
arXiv Detail & Related papers (2024-07-25T17:26:41Z) - Investigating Ensemble Methods for Model Robustness Improvement of Text
Classifiers [66.36045164286854]
We analyze a set of existing bias features and demonstrate there is no single model that works best for all the cases.
By choosing an appropriate bias model, we can obtain a better robustness result than baselines with a more sophisticated model design.
arXiv Detail & Related papers (2022-10-28T17:52:10Z) - Efficient Large Scale Language Modeling with Mixtures of Experts [61.45159383372181]
Mixture of Experts layers (MoEs) enable efficient scaling of language models through conditional computation.
This paper presents a detailed empirical study of how autoregressive MoE language models scale in comparison with dense models in a wide range of settings.
arXiv Detail & Related papers (2021-12-20T17:05:11Z) - When in Doubt, Summon the Titans: Efficient Inference with Large Models [80.2673230098021]
We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
arXiv Detail & Related papers (2021-10-19T22:56:49Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.