Scaling Ensemble Distribution Distillation to Many Classes with Proxy
Targets
- URL: http://arxiv.org/abs/2105.06987v1
- Date: Fri, 14 May 2021 17:50:14 GMT
- Title: Scaling Ensemble Distribution Distillation to Many Classes with Proxy
Targets
- Authors: Max Ryabinin, Andrey Malinin, Mark Gales
- Abstract summary: emphEnsemble Distribution Distillation is an approach that allows a single model to efficiently capture both the predictive performance and uncertainty estimates of an ensemble.
For classification, this is achieved by training a Dirichlet distribution over the ensemble members' output distributions via the maximum likelihood criterion.
Although theoretically, this criterion exhibits poor convergence when applied to large-scale tasks where the number of classes is very high.
- Score: 12.461503242570643
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Ensembles of machine learning models yield improved system performance as
well as robust and interpretable uncertainty estimates; however, their
inference costs may often be prohibitively high. \emph{Ensemble Distribution
Distillation} is an approach that allows a single model to efficiently capture
both the predictive performance and uncertainty estimates of an ensemble. For
classification, this is achieved by training a Dirichlet distribution over the
ensemble members' output distributions via the maximum likelihood criterion.
Although theoretically principled, this criterion exhibits poor convergence
when applied to large-scale tasks where the number of classes is very high. In
our work, we analyze this effect and show that the Dirichlet log-likelihood
criterion classes with low probability induce larger gradients than
high-probability classes. This forces the model to focus on the distribution of
the ensemble tail-class probabilities. We propose a new training objective that
minimizes the reverse KL-divergence to a \emph{Proxy-Dirichlet} target derived
from the ensemble. This loss resolves the gradient issues of Ensemble
Distribution Distillation, as we demonstrate both theoretically and empirically
on the ImageNet and WMT17 En-De datasets containing 1000 and 40,000 classes,
respectively.
Related papers
- Theory on Score-Mismatched Diffusion Models and Zero-Shot Conditional Samplers [49.97755400231656]
We present the first performance guarantee with explicit dimensional general score-mismatched diffusion samplers.
We show that score mismatches result in an distributional bias between the target and sampling distributions, proportional to the accumulated mismatch between the target and training distributions.
This result can be directly applied to zero-shot conditional samplers for any conditional model, irrespective of measurement noise.
arXiv Detail & Related papers (2024-10-17T16:42:12Z) - Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios.
We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z) - Contrastive Learning for Fair Representations [50.95604482330149]
Trained classification models can unintentionally lead to biased representations and predictions.
Existing debiasing methods for classification models, such as adversarial training, are often expensive to train and difficult to optimise.
We propose a method for mitigating bias by incorporating contrastive learning, in which instances sharing the same class label are encouraged to have similar representations.
arXiv Detail & Related papers (2021-09-22T10:47:51Z) - Distribution of Classification Margins: Are All Data Equal? [61.16681488656473]
We motivate theoretically and show empirically that the area under the curve of the margin distribution on the training set is in fact a good measure of generalization.
The resulting subset of "high capacity" features is not consistent across different training runs.
arXiv Detail & Related papers (2021-07-21T16:41:57Z) - Two-stage Training for Learning from Label Proportions [18.78148397471913]
Learning from label proportions (LLP) aims at learning an instance-level classifier with label proportions in grouped training data.
We introduce the mixup strategy and symmetric crossentropy to further reduce the label noise.
Our framework is model-agnostic, and demonstrates compelling performance improvement in extensive experiments.
arXiv Detail & Related papers (2021-05-22T03:55:35Z) - Beyond cross-entropy: learning highly separable feature distributions
for robust and accurate classification [22.806324361016863]
We propose a novel approach for training deep robust multiclass classifiers that provides adversarial robustness.
We show that the regularization of the latent space based on our approach yields excellent classification accuracy.
arXiv Detail & Related papers (2020-10-29T11:15:17Z) - Efficient Marginalization of Discrete and Structured Latent Variables
via Sparsity [26.518803984578867]
Training neural network models with discrete (categorical or structured) latent variables can be computationally challenging.
One typically resorts to sampling-based approximations of the true marginal.
We propose a new training strategy which replaces these estimators by an exact yet efficient marginalization.
arXiv Detail & Related papers (2020-07-03T19:36:35Z) - Good Classifiers are Abundant in the Interpolating Regime [64.72044662855612]
We develop a methodology to compute precisely the full distribution of test errors among interpolating classifiers.
We find that test errors tend to concentrate around a small typical value $varepsilon*$, which deviates substantially from the test error of worst-case interpolating model.
Our results show that the usual style of analysis in statistical learning theory may not be fine-grained enough to capture the good generalization performance observed in practice.
arXiv Detail & Related papers (2020-06-22T21:12:31Z) - Learning Diverse Representations for Fast Adaptation to Distribution
Shift [78.83747601814669]
We present a method for learning multiple models, incorporating an objective that pressures each to learn a distinct way to solve the task.
We demonstrate our framework's ability to facilitate rapid adaptation to distribution shift.
arXiv Detail & Related papers (2020-06-12T12:23:50Z) - Adversarial Classification via Distributional Robustness with
Wasserstein Ambiguity [12.576828231302134]
Under Wasserstein ambiguity, the model aims to minimize the value-at-risk of misclassification.
We show that, despite the non-marginity of this classification, standard descent methods appear to converger for this problem.
arXiv Detail & Related papers (2020-05-28T07:28:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.