Self-Training with Differentiable Teacher
- URL: http://arxiv.org/abs/2109.07049v1
- Date: Wed, 15 Sep 2021 02:06:13 GMT
- Title: Self-Training with Differentiable Teacher
- Authors: Simiao Zuo, Yue Yu, Chen Liang, Haoming Jiang, Siawpeng Er, Chao
Zhang, Tuo Zhao, Hongyuan Zha
- Abstract summary: Self-training achieves enormous success in various semi-supervised and weakly-supervised learning tasks.
The method can be interpreted as a teacher-student framework, where the teacher generates pseudo-labels, and the student makes predictions.
We propose ours, short for differentiable self-training, that treats teacher-student as a Stackelberg game.
- Score: 80.62757989797095
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-training achieves enormous success in various semi-supervised and
weakly-supervised learning tasks. The method can be interpreted as a
teacher-student framework, where the teacher generates pseudo-labels, and the
student makes predictions. The two models are updated alternatingly. However,
such a straightforward alternating update rule leads to training instability.
This is because a small change in the teacher may result in a significant
change in the student. To address this issue, we propose {\ours}, short for
differentiable self-training, that treats teacher-student as a Stackelberg
game. In this game, a leader is always in a more advantageous position than a
follower. In self-training, the student contributes to the prediction
performance, and the teacher controls the training process by generating
pseudo-labels. Therefore, we treat the student as the leader and the teacher as
the follower. The leader procures its advantage by acknowledging the follower's
strategy, which involves differentiable pseudo-labels and differentiable sample
weights. Consequently, the leader-follower interaction can be effectively
captured via Stackelberg gradient, obtained by differentiating the follower's
strategy. Experimental results on semi- and weakly-supervised classification
and named entity recognition tasks show that our model outperforms existing
approaches by large margins.
Related papers
- Switching Temporary Teachers for Semi-Supervised Semantic Segmentation [45.20519672287495]
The teacher-student framework, prevalent in semi-supervised semantic segmentation, mainly employs the exponential moving average (EMA) to update a single teacher's weights based on the student's.
This paper introduces Dual Teacher, a simple yet effective approach that employs dual temporary teachers aiming to alleviate the coupling problem for the student.
arXiv Detail & Related papers (2023-10-28T08:49:16Z) - Distantly-Supervised Named Entity Recognition with Adaptive Teacher
Learning and Fine-grained Student Ensemble [56.705249154629264]
Self-training teacher-student frameworks are proposed to improve the robustness of NER models.
In this paper, we propose an adaptive teacher learning comprised of two teacher-student networks.
Fine-grained student ensemble updates each fragment of the teacher model with a temporal moving average of the corresponding fragment of the student, which enhances consistent predictions on each model fragment against noise.
arXiv Detail & Related papers (2022-12-13T12:14:09Z) - Learning from Future: A Novel Self-Training Framework for Semantic
Segmentation [33.66516999361252]
Self-training has shown great potential in semi-supervised learning.
We propose a novel self-training strategy, which allows the model to learn from the future.
We experimentally demonstrate the effectiveness and superiority of our approach under a wide range of settings.
arXiv Detail & Related papers (2022-09-15T01:39:46Z) - Toward Student-Oriented Teacher Network Training For Knowledge Distillation [40.55715466657349]
We propose a teacher training method SoTeacher which incorporates Lipschitz regularization and consistency regularization into ERM.
Experiments on benchmark datasets using various knowledge distillation algorithms and teacher-student pairs confirm that SoTeacher can improve student accuracy consistently.
arXiv Detail & Related papers (2022-06-14T07:51:25Z) - Label Matching Semi-Supervised Object Detection [85.99282969977541]
Semi-supervised object detection has made significant progress with the development of mean teacher driven self-training.
Label mismatch problem is not yet fully explored in the previous works, leading to severe confirmation bias during self-training.
We propose a simple yet effective LabelMatch framework from two different yet complementary perspectives.
arXiv Detail & Related papers (2022-06-14T05:59:41Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Adversarial Training as Stackelberg Game: An Unrolled Optimization
Approach [91.74682538906691]
Adversarial training has been shown to improve the generalization performance of deep learning models.
We propose Stackelberg Adversarial Training (SALT), which formulates adversarial training as a Stackelberg game.
arXiv Detail & Related papers (2021-04-11T00:44:57Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Distilling Double Descent [65.85258126760502]
Distillation is the technique of training a "student" model based on examples that are labeled by a separate "teacher" model.
We show, that, even when the teacher model is highly over parameterized, and provides emphhard labels, using a very large held-out unlabeled dataset can result in a model that outperforms more "traditional" approaches.
arXiv Detail & Related papers (2021-02-13T02:26:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.