Efficient One Pass Self-distillation with Zipf's Label Smoothing
- URL: http://arxiv.org/abs/2207.12980v1
- Date: Tue, 26 Jul 2022 15:40:16 GMT
- Title: Efficient One Pass Self-distillation with Zipf's Label Smoothing
- Authors: Jiajun Liang, Linze Li, Zhaodong Bing, Borui Zhao, Yao Tang, Bo Lin
and Haoqiang Fan
- Abstract summary: Self-distillation exploits non-uniform soft supervision from itself during training and improves performance without any runtime cost.
This paper proposes Zipf's Label Smoothing (Zipf's LS), which uses the on-the-fly prediction of a network to generate soft supervision that conforms to Zipf distribution.
Our technique achieves +3.61% accuracy gain compared to the vanilla baseline, and 0.88% more gain against the previous label smoothing or self-distillation strategies.
- Score: 12.626049767353386
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-distillation exploits non-uniform soft supervision from itself during
training and improves performance without any runtime cost. However, the
overhead during training is often overlooked, and yet reducing time and memory
overhead during training is increasingly important in the giant models' era.
This paper proposes an efficient self-distillation method named Zipf's Label
Smoothing (Zipf's LS), which uses the on-the-fly prediction of a network to
generate soft supervision that conforms to Zipf distribution without using any
contrastive samples or auxiliary parameters. Our idea comes from an empirical
observation that when the network is duly trained the output values of a
network's final softmax layer, after sorting by the magnitude and averaged
across samples, should follow a distribution reminiscent to Zipf's Law in the
word frequency statistics of natural languages. By enforcing this property on
the sample level and throughout the whole training period, we find that the
prediction accuracy can be greatly improved. Using ResNet50 on the INAT21
fine-grained classification dataset, our technique achieves +3.61% accuracy
gain compared to the vanilla baseline, and 0.88% more gain against the previous
label smoothing or self-distillation strategies. The implementation is publicly
available at https://github.com/megvii-research/zipfls.
Related papers
- Enhancing Zero-Shot Vision Models by Label-Free Prompt Distribution Learning and Bias Correcting [55.361337202198925]
Vision-language models, such as CLIP, have shown impressive generalization capacities when using appropriate text descriptions.
We propose a label-Free prompt distribution learning and bias correction framework, dubbed as **Frolic**, which boosts zero-shot performance without the need for labeled data.
arXiv Detail & Related papers (2024-10-25T04:00:45Z) - Co-training for Low Resource Scientific Natural Language Inference [65.37685198688538]
We propose a novel co-training method that assigns weights based on the training dynamics of the classifiers to the distantly supervised labels.
By assigning importance weights instead of filtering out examples based on an arbitrary threshold on the predicted confidence, we maximize the usage of automatically labeled data.
The proposed method obtains an improvement of 1.5% in Macro F1 over the distant supervision baseline, and substantial improvements over several other strong SSL baselines.
arXiv Detail & Related papers (2024-06-20T18:35:47Z) - KAKURENBO: Adaptively Hiding Samples in Deep Neural Network Training [2.8804804517897935]
We propose a method for hiding the least-important samples during the training of deep neural networks.
We adaptively find samples to exclude in a given epoch based on their contribution to the overall learning process.
Our method can reduce total training time by up to 22% impacting accuracy only by 0.4% compared to the baseline.
arXiv Detail & Related papers (2023-10-16T06:19:29Z) - Federated Dynamic Sparse Training: Computing Less, Communicating Less,
Yet Learning Better [88.28293442298015]
Federated learning (FL) enables distribution of machine learning workloads from the cloud to resource-limited edge devices.
We develop, implement, and experimentally validate a novel FL framework termed Federated Dynamic Sparse Training (FedDST)
FedDST is a dynamic process that extracts and trains sparse sub-networks from the target full network.
arXiv Detail & Related papers (2021-12-18T02:26:38Z) - A new weakly supervised approach for ALS point cloud semantic
segmentation [1.4620086904601473]
We propose a deep-learning based weakly supervised framework for semantic segmentation of ALS point clouds.
We exploit potential information from unlabeled data subject to incomplete and sparse labels.
Our method achieves an overall accuracy of 83.0% and an average F1 score of 70.0%, which have increased by 6.9% and 12.8% respectively.
arXiv Detail & Related papers (2021-10-04T14:00:23Z) - Deep Semi-supervised Knowledge Distillation for Overlapping Cervical
Cell Instance Segmentation [54.49894381464853]
We propose to leverage both labeled and unlabeled data for instance segmentation with improved accuracy by knowledge distillation.
We propose a novel Mask-guided Mean Teacher framework with Perturbation-sensitive Sample Mining.
Experiments show that the proposed method improves the performance significantly compared with the supervised method learned from labeled data only.
arXiv Detail & Related papers (2020-07-21T13:27:09Z) - Uncertainty-aware Self-training for Text Classification with Few Labels [54.13279574908808]
We study self-training as one of the earliest semi-supervised learning approaches to reduce the annotation bottleneck.
We propose an approach to improve self-training by incorporating uncertainty estimates of the underlying neural network.
We show our methods leveraging only 20-30 labeled samples per class for each task for training and for validation can perform within 3% of fully supervised pre-trained language models.
arXiv Detail & Related papers (2020-06-27T08:13:58Z) - Stochastic Batch Augmentation with An Effective Distilled Dynamic Soft
Label Regularizer [11.153892464618545]
We propose a framework called Batch Augmentation safety of generalization (SBA) to address these problems.
SBA decides whether to augment at iterations controlled by the batch scheduler and in which a ''distilled'' dynamic soft regularization is introduced.
Our experiments on CIFAR-10, CIFAR-100, and ImageNet show that SBA can improve the generalization of the neural networks and speed up the convergence of network training.
arXiv Detail & Related papers (2020-06-27T04:46:39Z) - Omni-supervised Facial Expression Recognition via Distilled Data [120.11782405714234]
We propose omni-supervised learning to exploit reliable samples in a large amount of unlabeled data for network training.
We experimentally verify that the new dataset can significantly improve the ability of the learned FER model.
To tackle this, we propose to apply a dataset distillation strategy to compress the created dataset into several informative class-wise images.
arXiv Detail & Related papers (2020-05-18T09:36:51Z) - Regularization via Structural Label Smoothing [22.74769739125912]
Regularization is an effective way to promote the generalization performance of machine learning models.
In this paper, we focus on label smoothing, a form of output distribution regularization that prevents overfitting of a neural network.
We show that such label smoothing imposes a quantifiable bias in the Bayes error rate of the training data.
arXiv Detail & Related papers (2020-01-07T05:45:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.