Multi-View Feature Representation for Dialogue Generation with
Bidirectional Distillation
- URL: http://arxiv.org/abs/2102.10780v1
- Date: Mon, 22 Feb 2021 05:23:34 GMT
- Title: Multi-View Feature Representation for Dialogue Generation with
Bidirectional Distillation
- Authors: Shaoxiong Feng, Xuancheng Ren, Kan Li, Xu Sun
- Abstract summary: We propose a novel training framework, where the learning of general knowledge is more in line with the idea of reaching consensus.
Our framework effectively improves the model generalization without sacrificing training efficiency.
- Score: 22.14228918338769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural dialogue models suffer from low-quality responses when interacted in
practice, demonstrating difficulty in generalization beyond training data.
Recently, knowledge distillation has been used to successfully regularize the
student by transferring knowledge from the teacher. However, the teacher and
the student are trained on the same dataset and tend to learn similar feature
representations, whereas the most general knowledge should be found through
differences. The finding of general knowledge is further hindered by the
unidirectional distillation, as the student should obey the teacher and may
discard some knowledge that is truly general but refuted by the teacher. To
this end, we propose a novel training framework, where the learning of general
knowledge is more in line with the idea of reaching consensus, i.e., finding
common knowledge that is beneficial to different yet all datasets through
diversified learning partners. Concretely, the training task is divided into a
group of subtasks with the same number of students. Each student assigned to
one subtask not only is optimized on the allocated subtask but also imitates
multi-view feature representation aggregated from other students (i.e., student
peers), which induces students to capture common knowledge among different
subtasks and alleviates the over-fitting of students on the allocated subtasks.
To further enhance generalization, we extend the unidirectional distillation to
the bidirectional distillation that encourages the student and its student
peers to co-evolve by exchanging complementary knowledge with each other.
Empirical results and analysis demonstrate that our training framework
effectively improves the model generalization without sacrificing training
efficiency.
Related papers
- Tailoring Instructions to Student's Learning Levels Boosts Knowledge Distillation [52.53446712834569]
Learning Good Teacher Matters (LGTM) is an efficient training technique for incorporating distillation influence into the teacher's learning process.
Our LGTM outperforms 10 common knowledge distillation baselines on 6 text classification tasks in the GLUE benchmark.
arXiv Detail & Related papers (2023-05-16T17:50:09Z) - Automated Graph Self-supervised Learning via Multi-teacher Knowledge
Distillation [43.903582264697974]
This paper studies the problem of how to automatically, adaptively, and dynamically learn instance-level self-supervised learning strategies for each node.
We propose a novel multi-teacher knowledge distillation framework for Automated Graph Self-Supervised Learning (AGSSL)
Experiments on eight datasets show that AGSSL can benefit from multiple pretext tasks, outperforming the corresponding individual tasks.
arXiv Detail & Related papers (2022-10-05T08:39:13Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Does Knowledge Distillation Really Work? [106.38447017262183]
We show that while knowledge distillation can improve student generalization, it does not typically work as it is commonly understood.
We identify difficulties in optimization as a key reason for why the student is unable to match the teacher.
arXiv Detail & Related papers (2021-06-10T17:44:02Z) - Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student.
Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z) - Distilling Knowledge via Intermediate Classifier Heads [0.5584060970507505]
Knowledge distillation is a transfer-learning approach to train a resource-limited student model with the guide of a pre-trained larger teacher model.
We introduce knowledge distillation via intermediate heads to mitigate the impact of the capacity gap.
Our experiments on various teacher-student pairs and datasets have demonstrated that the proposed approach outperforms the canonical knowledge distillation approach.
arXiv Detail & Related papers (2021-02-28T12:52:52Z) - Collaborative Group Learning [42.31194030839819]
Collaborative learning has successfully applied knowledge transfer to guide a pool of small student networks towards robust local minima.
Previous approaches typically struggle with drastically aggravated student homogenization when the number of students rises.
We propose Collaborative Group Learning, an efficient framework that aims to diversify the feature representation and conduct an effective regularization.
arXiv Detail & Related papers (2020-09-16T14:34:39Z) - Dual Policy Distillation [58.43610940026261]
Policy distillation, which transfers a teacher policy to a student policy, has achieved great success in challenging tasks of deep reinforcement learning.
In this work, we introduce dual policy distillation(DPD), a student-student framework in which two learners operate on the same environment to explore different perspectives of the environment.
The key challenge in developing this dual learning framework is to identify the beneficial knowledge from the peer learner for contemporary learning-based reinforcement learning algorithms.
arXiv Detail & Related papers (2020-06-07T06:49:47Z) - Role-Wise Data Augmentation for Knowledge Distillation [48.115719640111394]
Knowledge Distillation (KD) is a common method for transferring the knowledge'' learned by one machine learning model into another.
We design data augmentation agents with distinct roles to facilitate knowledge distillation.
We find empirically that specially tailored data points enable the teacher's knowledge to be demonstrated more effectively to the student.
arXiv Detail & Related papers (2020-04-19T14:22:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.