Ensemble Distillation for Structured Prediction: Calibrated, Accurate,
Fast-Choose Three
- URL: http://arxiv.org/abs/2010.06721v2
- Date: Thu, 25 Mar 2021 17:32:03 GMT
- Title: Ensemble Distillation for Structured Prediction: Calibrated, Accurate,
Fast-Choose Three
- Authors: Steven Reich, David Mueller, Nicholas Andrews
- Abstract summary: We study ensemble distillation as a framework for producing well-calibrated structured prediction models.
We validate this framework on two tasks: named-entity recognition and machine translation.
We find that, across both tasks, ensemble distillation produces models which retain much of, and occasionally improve upon, the performance and calibration benefits of ensembles.
- Score: 7.169968368139168
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Modern neural networks do not always produce well-calibrated predictions,
even when trained with a proper scoring function such as cross-entropy. In
classification settings, simple methods such as isotonic regression or
temperature scaling may be used in conjunction with a held-out dataset to
calibrate model outputs. However, extending these methods to structured
prediction is not always straightforward or effective; furthermore, a held-out
calibration set may not always be available. In this paper, we study ensemble
distillation as a general framework for producing well-calibrated structured
prediction models while avoiding the prohibitive inference-time cost of
ensembles. We validate this framework on two tasks: named-entity recognition
and machine translation. We find that, across both tasks, ensemble distillation
produces models which retain much of, and occasionally improve upon, the
performance and calibration benefits of ensembles, while only requiring a
single model during test-time.
Related papers
- Towards Calibrated Robust Fine-Tuning of Vision-Language Models [97.19901765814431]
This work proposes a robust fine-tuning method that improves both OOD accuracy and confidence calibration simultaneously in vision language models.
We show that both OOD classification and OOD calibration errors have a shared upper bound consisting of two terms of ID data.
Based on this insight, we design a novel framework that conducts fine-tuning with a constrained multimodal contrastive loss enforcing a larger smallest singular value.
arXiv Detail & Related papers (2023-11-03T05:41:25Z) - Set Learning for Accurate and Calibrated Models [17.187117466317265]
Odd-$k$-out learning minimizes the cross-entropy error for sets rather than for single examples.
OKO often yields better calibration even when training with hard labels and dropping any additional calibration parameter tuning.
arXiv Detail & Related papers (2023-07-05T12:39:58Z) - Improving Adaptive Conformal Prediction Using Self-Supervised Learning [72.2614468437919]
We train an auxiliary model with a self-supervised pretext task on top of an existing predictive model and use the self-supervised error as an additional feature to estimate nonconformity scores.
We empirically demonstrate the benefit of the additional information using both synthetic and real data on the efficiency (width), deficit, and excess of conformal prediction intervals.
arXiv Detail & Related papers (2023-02-23T18:57:14Z) - Conformal inference is (almost) free for neural networks trained with
early stopping [1.2891210250935146]
Early stopping based on hold-out data is a popular regularization technique designed to mitigate overfitting and increase the predictive accuracy of neural networks.
This paper addresses the limitation with conformalized early stopping: a novel method that combines early stopping with conformal calibration while efficiently recycling the same hold-out data.
arXiv Detail & Related papers (2023-01-27T06:43:07Z) - HyperImpute: Generalized Iterative Imputation with Automatic Model
Selection [77.86861638371926]
We propose a generalized iterative imputation framework for adaptively and automatically configuring column-wise models.
We provide a concrete implementation with out-of-the-box learners, simulators, and interfaces.
arXiv Detail & Related papers (2022-06-15T19:10:35Z) - Functional Ensemble Distillation [18.34081591772928]
We investigate how to best distill an ensemble's predictions using an efficient model.
We find that learning the distilled model via a simple augmentation scheme in the form of mixup augmentation significantly boosts the performance.
arXiv Detail & Related papers (2022-06-05T14:07:17Z) - TACTiS: Transformer-Attentional Copulas for Time Series [76.71406465526454]
estimation of time-varying quantities is a fundamental component of decision making in fields such as healthcare and finance.
We propose a versatile method that estimates joint distributions using an attention-based decoder.
We show that our model produces state-of-the-art predictions on several real-world datasets.
arXiv Detail & Related papers (2022-02-07T21:37:29Z) - Cluster-and-Conquer: A Framework For Time-Series Forecasting [94.63501563413725]
We propose a three-stage framework for forecasting high-dimensional time-series data.
Our framework is highly general, allowing for any time-series forecasting and clustering method to be used in each step.
When instantiated with simple linear autoregressive models, we are able to achieve state-of-the-art results on several benchmark datasets.
arXiv Detail & Related papers (2021-10-26T20:41:19Z) - Evaluating Prediction-Time Batch Normalization for Robustness under
Covariate Shift [81.74795324629712]
We call prediction-time batch normalization, which significantly improves model accuracy and calibration under covariate shift.
We show that prediction-time batch normalization provides complementary benefits to existing state-of-the-art approaches for improving robustness.
The method has mixed results when used alongside pre-training, and does not seem to perform as well under more natural types of dataset shift.
arXiv Detail & Related papers (2020-06-19T05:08:43Z) - A general framework for ensemble distribution distillation [14.996944635904402]
Ensembles of neural networks have been shown to give better performance than single networks in terms of predictions and uncertainty estimation.
We present a framework for distilling both regression and classification ensembles in a way that preserves the decomposition.
arXiv Detail & Related papers (2020-02-26T14:34:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.