Related papers: Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation

URL: http://arxiv.org/abs/2505.10822v1
Date: Fri, 16 May 2025 03:37:40 GMT
Title: Distilled Circuits: A Mechanistic Study of Internal Restructuring in Knowledge Distillation
Authors: Reilly Haskins, Benjamin Adams,
Abstract summary: We analyze how internal circuits, representations, and activation patterns differ between teacher and student.<n>We find that student models reorganize, compress, and discard teacher components, often resulting in stronger reliance on fewer individual components.
Score: 0.3683202928838613
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation compresses a larger neural model (teacher) into smaller, faster student models by training them to match teacher outputs. However, the internal computational transformations that occur during this process remain poorly understood. We apply techniques from mechanistic interpretability to analyze how internal circuits, representations, and activation patterns differ between teacher and student. Focusing on GPT2-small and its distilled counterpart DistilGPT2, we find that student models reorganize, compress, and discard teacher components, often resulting in stronger reliance on fewer individual components. To quantify functional alignment beyond output similarity, we introduce an alignment metric based on influence-weighted component similarity, validated across multiple tasks. Our findings reveal that while knowledge distillation preserves broad functional behaviors, it also causes significant shifts in internal computation, with important implications for the robustness and generalization capacity of distilled models.

Related papers

Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models [3.287942619833188]
We systematically evaluate the transferability of knowledge distillation from a Transformer teacher to nine subquadratic student architectures.<n>Our study aims to determine which subquadratic model best aligns with the teacher's learned representations and how different architectural constraints influence the distillation process.
arXiv Detail & Related papers (2025-04-19T17:49:52Z)
Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z)
Rethinking Associative Memory Mechanism in Induction Head [37.93644115914534]
This paper investigates how a two-layer transformer thoroughly captures in-context information and balances it with pretrained bigram knowledge in next token prediction.<n>We theoretically analyze the representation of weight matrices in attention layers and the resulting logits when a transformer is given prompts generated by a bigram model.
arXiv Detail & Related papers (2024-12-16T05:33:05Z)
Quantifying Knowledge Distillation Using Partial Information Decomposition [14.82261635235695]
We use Partial Information Decomposition to quantify and explain the transferred knowledge and knowledge left to distill.<n>We propose a novel multi-level optimization to incorporate redundant information as a regularizer, leading to our framework of Redundant Information Distillation (RID)
arXiv Detail & Related papers (2024-11-12T02:12:41Z)
Learning Lightweight Object Detectors via Multi-Teacher Progressive Distillation [56.053397775016755]
We propose a sequential approach to knowledge distillation that progressively transfers the knowledge of a set of teacher detectors to a given lightweight student. To the best of our knowledge, we are the first to successfully distill knowledge from Transformer-based teacher detectors to convolution-based students.
arXiv Detail & Related papers (2023-08-17T17:17:08Z)
Supervision Complexity and its Role in Knowledge Distillation [65.07910515406209]
We study the generalization behavior of a distilled student. The framework highlights a delicate interplay among the teacher's accuracy, the student's margin with respect to the teacher predictions, and the complexity of the teacher predictions. We demonstrate efficacy of online distillation and validate the theoretical findings on a range of image classification benchmarks and model architectures.
arXiv Detail & Related papers (2023-01-28T16:34:47Z)
DETRDistill: A Universal Knowledge Distillation Framework for DETR-families [11.9748352746424]
Transformer-based detectors (DETRs) have attracted great attention due to their sparse training paradigm and the removal of post-processing operations. Knowledge distillation (KD) can be employed to compress the huge model by constructing a universal teacher-student learning framework.
arXiv Detail & Related papers (2022-11-17T13:35:11Z)
On the benefits of knowledge distillation for adversarial robustness [53.41196727255314]
We show that knowledge distillation can be used directly to boost the performance of state-of-the-art models in adversarial robustness. We present Adversarial Knowledge Distillation (AKD), a new framework to improve a model's robust performance.
arXiv Detail & Related papers (2022-03-14T15:02:13Z)
Teacher's pet: understanding and mitigating biases in distillation [61.44867470297283]
Several works have shown that distillation significantly boosts the student's overall performance. However, are these gains uniform across all data subgroups? We show that distillation can harm performance on certain subgroups. We present techniques which soften the teacher influence for subgroups where it is less reliable.
arXiv Detail & Related papers (2021-06-19T13:06:25Z)
Structural Knowledge Distillation: Tractably Distilling Information for Structured Predictor [70.71045044998043]
The objective function of knowledge distillation is typically the cross-entropy between the teacher and the student's output distributions. For structured prediction problems, the output space is exponential in size. We show the tractability and empirical effectiveness of structural knowledge distillation between sequence labeling and dependency parsing models.
arXiv Detail & Related papers (2020-10-10T14:19:25Z)

This list is automatically generated from the titles and abstracts of the papers in this site.