Causal Distillation for Language Models
- URL: http://arxiv.org/abs/2112.02505v1
- Date: Sun, 5 Dec 2021 08:13:09 GMT
- Title: Causal Distillation for Language Models
- Authors: Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu,
Thomas Icard, Christopher Potts, Noah D. Goodman
- Abstract summary: We show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher.
Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia.
- Score: 23.68246698789134
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Distillation efforts have led to language models that are more compact and
efficient without serious drops in performance. The standard approach to
distillation trains a student model against two objectives: a task-specific
objective (e.g., language modeling) and an imitation objective that encourages
the hidden states of the student model to be similar to those of the larger
teacher model. In this paper, we show that it is beneficial to augment
distillation with a third objective that encourages the student to imitate the
causal computation process of the teacher through interchange intervention
training(IIT). IIT pushes the student model to become a causal abstraction of
the teacher model - a simpler model with the same causal structure. IIT is
fully differentiable, easily implemented, and combines flexibly with other
objectives. Compared with standard distillation of BERT, distillation via IIT
results in lower perplexity on Wikipedia (masked language modeling) and marked
improvements on the GLUE benchmark (natural language understanding), SQuAD
(question answering), and CoNLL-2003 (named entity recognition).
Related papers
- Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Improving Neural Topic Models with Wasserstein Knowledge Distillation [0.8962460460173959]
We propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality.
Experiments show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model.
arXiv Detail & Related papers (2023-03-27T16:07:44Z) - HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained
Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation.
It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints.
We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z) - EmbedDistill: A Geometric Knowledge Distillation for Information
Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR)
We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model.
We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z) - Inducing Causal Structure for Interpretable Neural Networks [23.68246698789134]
We present the new method of interchange intervention training(IIT)
In IIT, we (1)align variables in the causal model with representations in the neural model and (2) train a neural model to match the counterfactual behavior of the causal model on a base input.
IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is acausal abstraction of the neural model.
arXiv Detail & Related papers (2021-12-01T21:07:01Z) - Localization Distillation for Object Detection [79.78619050578997]
We propose localization distillation (LD) for object detection.
Our LD can be formulated as standard KD by adopting the general localization representation of bounding box.
We suggest a teacher assistant (TA) strategy to fill the possible gap between teacher model and student model.
arXiv Detail & Related papers (2021-02-24T12:26:21Z) - Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression.
Current methods assign a fixed weight to a teacher model in the whole distillation.
Most of the existing methods allocate an equal weight to every teacher model.
In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z) - Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models.
We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network.
We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z) - Contrastive Distillation on Intermediate Representations for Language
Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework.
By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers.
CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.