Related papers: Causal Distillation for Language Models

Causal Distillation for Language Models

URL: http://arxiv.org/abs/2112.02505v1
Date: Sun, 5 Dec 2021 08:13:09 GMT
Title: Causal Distillation for Language Models
Authors: Zhengxuan Wu, Atticus Geiger, Josh Rozner, Elisa Kreiss, Hanson Lu, Thomas Icard, Christopher Potts, Noah D. Goodman
Abstract summary: We show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia.
Score: 23.68246698789134
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Distillation efforts have led to language models that are more compact and efficient without serious drops in performance. The standard approach to distillation trains a student model against two objectives: a task-specific objective (e.g., language modeling) and an imitation objective that encourages the hidden states of the student model to be similar to those of the larger teacher model. In this paper, we show that it is beneficial to augment distillation with a third objective that encourages the student to imitate the causal computation process of the teacher through interchange intervention training(IIT). IIT pushes the student model to become a causal abstraction of the teacher model - a simpler model with the same causal structure. IIT is fully differentiable, easily implemented, and combines flexibly with other objectives. Compared with standard distillation of BERT, distillation via IIT results in lower perplexity on Wikipedia (masked language modeling) and marked improvements on the GLUE benchmark (natural language understanding), SQuAD (question answering), and CoNLL-2003 (named entity recognition).

Related papers

Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only. Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Improving In-Context Learning with Reasoning Distillation [25.377625891065236]
Language models rely on semantic priors to perform in-context learning. We propose ReDis, a reasoning distillation technique designed to improve the inductive reasoning capabilities of language models.
arXiv Detail & Related papers (2025-04-14T18:59:10Z)
UNDO: Understanding Distillation as Optimization [9.100811514331498]
We introduce the UNDO: UNderstanding Distillation as Optimization framework. Each iteration directly targets the student's learning deficiencies, motivating the teacher to provide tailored and enhanced rationales. Empirical evaluations on various challenging mathematical and commonsense reasoning tasks demonstrate that our iterative distillation method, UNDO, significantly outperforms standard one-step distillation methods.
arXiv Detail & Related papers (2025-04-03T12:18:51Z)
TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8298782282181865]
We introduce $textitTemporally Adaptive Interpolated Distillation (TAID)$, a novel knowledge distillation approach. We show TAID's superior performance across various model sizes and architectures in both instruction tuning and pre-training scenarios. These results demonstrate TAID's effectiveness in creating high-performing and efficient models, advancing the development of more accessible AI technologies.
arXiv Detail & Related papers (2025-01-28T13:31:18Z)
Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them. This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model. OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z)
Improving Neural Topic Models with Wasserstein Knowledge Distillation [0.8962460460173959]
We propose a knowledge distillation framework to compress a contextualized topic model without loss in topic quality. Experiments show that the student trained with knowledge distillation achieves topic coherence much higher than that of the original student model.
arXiv Detail & Related papers (2023-03-27T16:07:44Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Inducing Causal Structure for Interpretable Neural Networks [23.68246698789134]
We present the new method of interchange intervention training(IIT) In IIT, we (1)align variables in the causal model with representations in the neural model and (2) train a neural model to match the counterfactual behavior of the causal model on a base input. IIT is fully differentiable, flexibly combines with other objectives, and guarantees that the target causal model is acausal abstraction of the neural model.
arXiv Detail & Related papers (2021-12-01T21:07:01Z)
Localization Distillation for Object Detection [79.78619050578997]
We propose localization distillation (LD) for object detection. Our LD can be formulated as standard KD by adopting the general localization representation of bounding box. We suggest a teacher assistant (TA) strategy to fill the possible gap between teacher model and student model.
arXiv Detail & Related papers (2021-02-24T12:26:21Z)
Reinforced Multi-Teacher Selection for Knowledge Distillation [54.72886763796232]
knowledge distillation is a popular method for model compression. Current methods assign a fixed weight to a teacher model in the whole distillation. Most of the existing methods allocate an equal weight to every teacher model. In this paper, we observe that, due to the complexity of training examples and the differences in student model capability, learning differentially from teacher models can lead to better performance of student models distilled.
arXiv Detail & Related papers (2020-12-11T08:56:39Z)
Online Knowledge Distillation via Multi-branch Diversity Enhancement [15.523646047674717]
We propose a new distillation method to enhance the diversity among multiple student models. We use Feature Fusion Module (FFM), which improves the performance of the attention mechanism in the network. We also use Diversification(CD) loss function to strengthen the differences between the student models.
arXiv Detail & Related papers (2020-10-02T05:52:12Z)
Contrastive Distillation on Intermediate Representations for Language Model Compression [89.31786191358802]
We propose Contrastive Distillation on Intermediate Representations (CoDIR) as a principled knowledge distillation framework. By learning to distinguish positive sample from a large set of negative samples, CoDIR facilitates the student's exploitation of rich information in teacher's hidden layers. CoDIR can be readily applied to compress large-scale language models in both pre-training and finetuning stages, and achieves superb performance on the GLUE benchmark.
arXiv Detail & Related papers (2020-09-29T17:31:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.