Confidence Preservation Property in Knowledge Distillation Abstractions
- URL: http://arxiv.org/abs/2401.11365v1
- Date: Sun, 21 Jan 2024 01:37:25 GMT
- Title: Confidence Preservation Property in Knowledge Distillation Abstractions
- Authors: Dmitry Vengertsev, Elena Sherman
- Abstract summary: Social media platforms prevent malicious activities by detecting harmful content of posts and comments.
They employ large-scale deep neural network language models for sentiment analysis and content understanding.
Some models, like BERT, are complex, and have numerous parameters, which makes them expensive to operate and maintain.
Industry experts employ a knowledge distillation compression technique, where a distilled model is trained to reproduce the classification behavior of the original model.
- Score: 2.9370710299422598
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Social media platforms prevent malicious activities by detecting harmful
content of posts and comments. To that end, they employ large-scale deep neural
network language models for sentiment analysis and content understanding. Some
models, like BERT, are complex, and have numerous parameters, which makes them
expensive to operate and maintain. To overcome these deficiencies, industry
experts employ a knowledge distillation compression technique, where a
distilled model is trained to reproduce the classification behavior of the
original model. The distillation processes terminates when the distillation
loss function reaches the stopping criteria. This function is mainly designed
to ensure that the original and the distilled models exhibit alike
classification behaviors. However, besides classification accuracy, there are
additional properties of the original model that the distilled model should
preserve to be considered as an appropriate abstraction. In this work, we
explore whether distilled TinyBERT models preserve confidence values of the
original BERT models, and investigate how this confidence preservation property
could guide tuning hyperparameters of the distillation process.
Related papers
- Training on the Test Model: Contamination in Ranking Distillation [14.753216172912968]
We investigate the effect of a contaminated teacher model in a distillation setting.
We find that contamination occurs even when the test data represents a small fraction of the teacher's training samples.
arXiv Detail & Related papers (2024-11-04T17:11:14Z) - Generic-to-Specific Distillation of Masked Autoencoders [119.21281960831651]
We propose generic-to-specific distillation (G2SD) to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders.
With G2SD, the vanilla ViT-Small model achieves 98.7%, 98.1% and 99.3% the performance of its teacher for image classification, object detection, and semantic segmentation.
arXiv Detail & Related papers (2023-02-28T17:13:14Z) - Watermarking for Out-of-distribution Detection [76.20630986010114]
Out-of-distribution (OOD) detection aims to identify OOD data based on representations extracted from well-trained deep models.
We propose a general methodology named watermarking in this paper.
We learn a unified pattern that is superimposed onto features of original data, and the model's detection capability is largely boosted after watermarking.
arXiv Detail & Related papers (2022-10-27T06:12:32Z) - Normalized Feature Distillation for Semantic Segmentation [6.882655287146012]
We propose a simple yet effective feature distillation method called normalized feature distillation (NFD)
Our method achieves state-of-the-art distillation results for semantic segmentation on Cityscapes, VOC 2012, and ADE20K datasets.
arXiv Detail & Related papers (2022-07-12T01:54:25Z) - Explain, Edit, and Understand: Rethinking User Study Design for
Evaluating Model Explanations [97.91630330328815]
We conduct a crowdsourcing study, where participants interact with deception detection models that have been trained to distinguish between genuine and fake hotel reviews.
We observe that for a linear bag-of-words model, participants with access to the feature coefficients during training are able to cause a larger reduction in model confidence in the testing phase when compared to the no-explanation control.
arXiv Detail & Related papers (2021-12-17T18:29:56Z) - Why Can You Lay Off Heads? Investigating How BERT Heads Transfer [37.9520341259181]
The main goal of distillation is to create a task-agnostic pre-trained model that can be fine-tuned on downstream tasks without fine-tuning its full-sized version.
Despite the progress of distillation, to what degree and for what reason a task-agnostic model can be created from distillation has not been well studied.
This work focuses on analyzing the acceptable deduction when distillation for guiding the future distillation procedure.
arXiv Detail & Related papers (2021-06-14T02:27:47Z) - Churn Reduction via Distillation [54.5952282395487]
We show an equivalence between training with distillation using the base model as the teacher and training with an explicit constraint on the predictive churn.
We then show that distillation performs strongly for low churn training against a number of recent baselines.
arXiv Detail & Related papers (2021-06-04T18:03:31Z) - Contrastive Model Inversion for Data-Free Knowledge Distillation [60.08025054715192]
We propose Contrastive Model Inversion, where the data diversity is explicitly modeled as an optimizable objective.
Our main observation is that, under the constraint of the same amount of data, higher data diversity usually indicates stronger instance discrimination.
Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that CMI achieves significantly superior performance when the generated data are used for knowledge distillation.
arXiv Detail & Related papers (2021-05-18T15:13:00Z) - Self-Feature Regularization: Self-Feature Distillation Without Teacher
Models [0.0]
Self-Feature Regularization(SFR) is proposed, which uses features in the deep layers to supervise feature learning in the shallow layers.
We firstly use generalization-l2 loss to match local features and a many-to-one approach to distill more intensively in the channel dimension.
arXiv Detail & Related papers (2021-03-12T15:29:00Z) - Pre-trained Summarization Distillation [121.14806854092672]
Recent work on distilling BERT for classification and regression tasks shows strong performance using direct knowledge distillation.
Alternatively, machine translation practitioners distill using pseudo-labeling, where a small model is trained on the translations of a larger model.
A third, simpler approach is to'shrink and fine-tune' (SFT), which avoids any explicit distillation by copying parameters to a smaller student model and then fine-tuning.
arXiv Detail & Related papers (2020-10-24T23:15:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.