Logit-Based Ensemble Distribution Distillation for Robust Autoregressive
Sequence Uncertainties
- URL: http://arxiv.org/abs/2305.10384v1
- Date: Wed, 17 May 2023 17:21:10 GMT
- Title: Logit-Based Ensemble Distribution Distillation for Robust Autoregressive
Sequence Uncertainties
- Authors: Yassir Fathullah, Guoxuan Xia, Mark Gales
- Abstract summary: We investigate Ensemble Distribution Distillation (EDD) applied to large-scale natural language sequence-to-sequence data.
EDD aims to compress the superior uncertainty performance of an expensive (teacher) ensemble into a cheaper (student) single model.
We show, for modern transformer architectures on large-scale translation tasks, that modelling the ensemble logits, instead of softmax probabilities, leads to significantly better students.
- Score: 4.8986598953553555
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Efficiently and reliably estimating uncertainty is an important objective in
deep learning. It is especially pertinent to autoregressive sequence tasks,
where training and inference costs are typically very high. However, existing
research has predominantly focused on tasks with static data such as image
classification. In this work, we investigate Ensemble Distribution Distillation
(EDD) applied to large-scale natural language sequence-to-sequence data. EDD
aims to compress the superior uncertainty performance of an expensive (teacher)
ensemble into a cheaper (student) single model. Importantly, the ability to
separate knowledge (epistemic) and data (aleatoric) uncertainty is retained.
Existing probability-space approaches to EDD, however, are difficult to scale
to large vocabularies. We show, for modern transformer architectures on
large-scale translation tasks, that modelling the ensemble logits, instead of
softmax probabilities, leads to significantly better students. Moreover, the
students surprisingly even outperform Deep Ensembles by up to ~10% AUROC on
out-of-distribution detection, whilst matching them at in-distribution
translation.
Related papers
- FedUV: Uniformity and Variance for Heterogeneous Federated Learning [5.9330433627374815]
Federated learning is a promising framework to train neural networks with widely distributed data.
Recent work has shown this is due to the final layer of the network being most prone to local bias.
We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values.
arXiv Detail & Related papers (2024-02-27T15:53:15Z) - Noisy Correspondence Learning with Self-Reinforcing Errors Mitigation [63.180725016463974]
Cross-modal retrieval relies on well-matched large-scale datasets that are laborious in practice.
We introduce a novel noisy correspondence learning framework, namely textbfSelf-textbfReinforcing textbfErrors textbfMitigation (SREM)
arXiv Detail & Related papers (2023-12-27T09:03:43Z) - DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Uncertainty-Aware Bootstrap Learning for Joint Extraction on
Distantly-Supervised Data [36.54640096189285]
bootstrap learning is motivated by the intuition that the higher uncertainty of an instance, the more likely the model confidence is inconsistent with the ground truths.
We first explore instance-level data uncertainty to create an initial high-confident examples.
During bootstrap learning, we propose self-ensembling as a regularizer to alleviate inter-model uncertainty produced by noisy labels.
arXiv Detail & Related papers (2023-05-05T20:06:11Z) - Implicit Counterfactual Data Augmentation for Robust Learning [24.795542869249154]
This study proposes an Implicit Counterfactual Data Augmentation method to remove spurious correlations and make stable predictions.
Experiments have been conducted across various biased learning scenarios covering both image and text datasets.
arXiv Detail & Related papers (2023-04-26T10:36:40Z) - DUDES: Deep Uncertainty Distillation using Ensembles for Semantic
Segmentation [11.099838952805325]
Quantifying the predictive uncertainty is a promising endeavour to open up the use of deep neural networks for such applications.
We present a novel approach for efficient and reliable uncertainty estimation which we call Deep Uncertainty Distillation using Ensembles (DUDES)
DUDES applies student-teacher distillation with a Deep Ensemble to accurately approximate predictive uncertainties with a single forward pass.
arXiv Detail & Related papers (2023-03-17T08:56:27Z) - BLISS: Robust Sequence-to-Sequence Learning via Self-Supervised Input
Representation [92.75908003533736]
We propose a framework-level robust sequence-to-sequence learning approach, named BLISS, via self-supervised input representation.
We conduct comprehensive experiments to validate the effectiveness of BLISS on various tasks, including machine translation, grammatical error correction, and text summarization.
arXiv Detail & Related papers (2022-04-16T16:19:47Z) - Scale-Equivalent Distillation for Semi-Supervised Object Detection [57.59525453301374]
Recent Semi-Supervised Object Detection (SS-OD) methods are mainly based on self-training, generating hard pseudo-labels by a teacher model on unlabeled data as supervisory signals.
We analyze the challenges these methods meet with the empirical experiment results.
We introduce a novel approach, Scale-Equivalent Distillation (SED), which is a simple yet effective end-to-end knowledge distillation framework robust to large object size variance and class imbalance.
arXiv Detail & Related papers (2022-03-23T07:33:37Z) - Uncertainty Estimation for Language Reward Models [5.33024001730262]
Language models can learn a range of capabilities from unsupervised training on text corpora.
It is often easier for humans to choose between options than to provide labeled data, and prior work has achieved state-of-the-art performance by training a reward model from such preference comparisons.
We seek to address these problems via uncertainty estimation, which can improve sample efficiency and robustness using active learning and risk-averse reinforcement learning.
arXiv Detail & Related papers (2022-03-14T20:13:21Z) - Learning while Respecting Privacy and Robustness to Distributional
Uncertainties and Adversarial Data [66.78671826743884]
The distributionally robust optimization framework is considered for training a parametric model.
The objective is to endow the trained model with robustness against adversarially manipulated input data.
Proposed algorithms offer robustness with little overhead.
arXiv Detail & Related papers (2020-07-07T18:25:25Z) - Diversity inducing Information Bottleneck in Model Ensembles [73.80615604822435]
In this paper, we target the problem of generating effective ensembles of neural networks by encouraging diversity in prediction.
We explicitly optimize a diversity inducing adversarial loss for learning latent variables and thereby obtain diversity in the output predictions necessary for modeling multi-modal data.
Compared to the most competitive baselines, we show significant improvements in classification accuracy, under a shift in the data distribution.
arXiv Detail & Related papers (2020-03-10T03:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.