Structural Knowledge Distillation: Tractably Distilling Information for
Structured Predictor
- URL: http://arxiv.org/abs/2010.05010v4
- Date: Wed, 2 Jun 2021 02:31:19 GMT
- Title: Structural Knowledge Distillation: Tractably Distilling Information for
Structured Predictor
- Authors: Xinyu Wang, Yong Jiang, Zhaohui Yan, Zixia Jia, Nguyen Bach, Tao Wang,
Zhongqiang Huang, Fei Huang, Kewei Tu
- Abstract summary: The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions.
For structured prediction problems, the output space is exponential in size.
We show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models.
- Score: 70.71045044998043
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Knowledge distillation is a critical technique to transfer knowledge between
models, typically from a large model (the teacher) to a more fine-grained one
(the student). The objective function of knowledge distillation is typically
the cross-entropy between the teacher and the student's output distributions.
However, for structured prediction problems, the output space is exponential in
size; therefore, the cross-entropy objective becomes intractable to compute and
optimize directly. In this paper, we derive a factorized form of the knowledge
distillation objective for structured prediction, which is tractable for many
typical choices of the teacher and student models. In particular, we show the
tractability and empirical effectiveness of structural knowledge distillation
between sequence labeling and dependency parsing models under four different
scenarios: 1) the teacher and student share the same factorization form of the
output structure scoring function; 2) the student factorization produces more
fine-grained substructures than the teacher factorization; 3) the teacher
factorization produces more fine-grained substructures than the student
factorization; 4) the factorization forms from the teacher and the student are
incompatible.
Related papers
- Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student.
The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions.
We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z) - Generalized Knowledge Distillation via Relationship Matching [53.69235109551099]
Knowledge of a well-trained deep neural network (a.k.a. the "teacher") is valuable for learning similar tasks.
Knowledge distillation extracts knowledge from the teacher and integrates it with the target model.
Instead of enforcing the teacher to work on the same task as the student, we borrow the knowledge from a teacher trained from a general label space.
arXiv Detail & Related papers (2022-05-04T06:49:47Z) - Systematic Evaluation of Causal Discovery in Visual Model Based
Reinforcement Learning [76.00395335702572]
A central goal for AI and causality is the joint discovery of abstract representations and causal structure.
Existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs.
In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them.
arXiv Detail & Related papers (2021-07-02T05:44:56Z) - Towards Understanding Knowledge Distillation [37.71779364624616]
Knowledge distillation is an empirically very successful technique for knowledge transfer between classifiers.
There is no satisfactory theoretical explanation of this phenomenon.
We provide the first insights into the working mechanisms of distillation by studying the special case of linear and deep linear classifiers.
arXiv Detail & Related papers (2021-05-27T12:45:08Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Causal Structure Learning: a Bayesian approach based on random graphs [0.0]
We take advantage of the expressibility of graphs in order to model the uncertainty about the existence of causal relationships.
We adopt a Bayesian point of view in order to capture a causal structure via interaction and learning with a causal environment.
arXiv Detail & Related papers (2020-10-13T04:13:06Z) - Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework.
DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search.
Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z) - Understanding and Improving Knowledge Distillation [13.872105118381938]
Knowledge Distillation (KD) is a model-agnostic technique to improve model quality while having a fixed capacity budget.
This paper categorizes teacher's knowledge into three hierarchical levels and study its effects on knowledge distillation.
arXiv Detail & Related papers (2020-02-10T04:21:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.