When in Doubt, Summon the Titans: Efficient Inference with Large Models
- URL: http://arxiv.org/abs/2110.10305v1
- Date: Tue, 19 Oct 2021 22:56:49 GMT
- Title: When in Doubt, Summon the Titans: Efficient Inference with Large Models
- Authors: Ankit Singh Rawat, Manzil Zaheer, Aditya Krishna Menon, Amr Ahmed,
Sanjiv Kumar
- Abstract summary: We propose a two-stage framework based on distillation that realizes the modelling benefits of large models.
We use the large teacher models to guide the lightweight student models to only make correct predictions on a subset of "easy" examples.
Our proposed use of distillation to only handle easy instances allows for a more aggressive trade-off in the student size, thereby reducing the amortized cost of inference.
- Score: 80.2673230098021
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scaling neural networks to "large" sizes, with billions of parameters, has
been shown to yield impressive results on many challenging problems. However,
the inference cost incurred by such large models often prevents their
application in most real-world settings. In this paper, we propose a two-stage
framework based on distillation that realizes the modelling benefits of the
large models, while largely preserving the computational benefits of inference
with more lightweight models. In a nutshell, we use the large teacher models to
guide the lightweight student models to only make correct predictions on a
subset of "easy" examples; for the "hard" examples, we fall-back to the
teacher. Such an approach allows us to efficiently employ large models in
practical scenarios where easy examples are much more frequent than rare hard
examples. Our proposed use of distillation to only handle easy instances allows
for a more aggressive trade-off in the student size, thereby reducing the
amortized cost of inference and achieving better accuracy than standard
distillation. Empirically, we demonstrate the benefits of our approach on both
image classification and natural language processing benchmarks.
Related papers
- Debias the Black-box: A Fair Ranking Framework via Knowledge
Distillation [26.60241524303918]
We propose a fair information retrieval framework based on knowledge distillation.
This framework can improve the exposure-based fairness of models while considerably decreasing model size.
It also improves fairness performance by 15%46% while keeping a high level of recommendation effectiveness.
arXiv Detail & Related papers (2022-08-24T15:59:58Z) - Easy Batch Normalization [73.89838982331453]
Easy examples are samples that the machine learning model classifies correctly with high confidence.
We propose to use an auxiliary batch normalization for easy examples for the standard and robust accuracy improvement.
arXiv Detail & Related papers (2022-07-18T21:01:09Z) - Dropout Inference with Non-Uniform Weight Scaling [6.726255259929496]
Dropout as regularization has been used extensively to prevent overfitting for training neural networks.
In this work, we demonstrate scenarios where some submodels behave closer to high-bias models and a non-uniform weight scaling is a better approximation for inference.
arXiv Detail & Related papers (2022-04-27T16:41:12Z) - Predicting on the Edge: Identifying Where a Larger Model Does Better [61.793778186198864]
We show that large models have the largest improvement on examples where the small model is most uncertain.
We show that a switcher model which defers examples to a larger model when a small model is uncertain can achieve striking improvements in performance and resource usage.
arXiv Detail & Related papers (2022-02-15T18:53:14Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z) - When Ensembling Smaller Models is More Efficient than Single Large
Models [52.38997176317532]
We show that ensembles can outperform single models with both higher accuracy and requiring fewer total FLOPs to compute.
This presents an interesting observation that output diversity in ensembling can often be more efficient than training larger models.
arXiv Detail & Related papers (2020-05-01T18:56:18Z) - Robust and On-the-fly Dataset Denoising for Image Classification [72.10311040730815]
On-the-fly Data Denoising (ODD) is robust to mislabeled examples, while introducing almost zero computational overhead compared to standard training.
ODD is able to achieve state-of-the-art results on a wide range of datasets including real-world ones such as WebVision and Clothing1M.
arXiv Detail & Related papers (2020-03-24T03:59:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.