Related papers: Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs

Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs

URL: http://arxiv.org/abs/2311.17781v1
Date: Wed, 29 Nov 2023 16:26:24 GMT
Title: Propagate & Distill: Towards Effective Graph Learners Using Propagation-Embracing MLPs
Authors: Yong-Min Shin, Won-Yong Shin
Abstract summary: We train a student by knowledge distillation from a teacher graph neural network (GNN) Inspired by GNNs that separate feature transformation $T$, we re-frame the distillation process as making the student learn both $T$ and $Pi$. We propose Propagate & Distill (P&D), which propagates the output of the teacher before distillation, which can be interpreted as an approximate process of inverse propagation.
Score: 9.731314045194495
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent studies attempted to utilize multilayer perceptrons (MLPs) to solve semisupervised node classification on graphs, by training a student MLP by knowledge distillation from a teacher graph neural network (GNN). While previous studies have focused mostly on training the student MLP by matching the output probability distributions between the teacher and student models during distillation, it has not been systematically studied how to inject the structural information in an explicit and interpretable manner. Inspired by GNNs that separate feature transformation $T$ and propagation $\Pi$, we re-frame the distillation process as making the student MLP learn both $T$ and $\Pi$. Although this can be achieved by applying the inverse propagation $\Pi^{-1}$ before distillation from the teacher, it still comes with a high computational cost from large matrix multiplications during training. To solve this problem, we propose Propagate & Distill (P&D), which propagates the output of the teacher before distillation, which can be interpreted as an approximate process of the inverse propagation. We demonstrate that P&D can readily improve the performance of the student MLP.

Related papers

Heuristic Methods are Good Teachers to Distill MLPs for Graph Link Prediction [61.70012924088756]
Distilling Graph Neural Networks (GNNs) teachers into Multi-Layer Perceptrons (MLPs) students has emerged as an effective approach to achieve strong performance. However, existing distillation methods only use standard GNNs and overlook alternative teachers such as specialized model for link prediction (GNN4LP) and methods (e.g., common neighbors) This paper first explores the impact of different teachers in GNN-to-MLP distillation, we find that stronger teachers do not always produce stronger students, while weaker methods can teachs to near-GNN performance with drastically reduced training costs
arXiv Detail & Related papers (2025-04-08T16:35:11Z)
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT) We propose a novel distillation pipeline that transfers both responses and rewards. Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z)
On Teacher Hacking in Language Model Distillation [61.19867259475047]
We investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. Online data generation techniques effectively mitigates teacher hacking.
arXiv Detail & Related papers (2025-02-04T19:26:28Z)
Teach Harder, Learn Poorer: Rethinking Hard Sample Distillation for GNN-to-MLP Knowledge Distillation [56.912354708167534]
Graph Neural Networks (GNNs) and lightweight Multi-Layer Perceptron (MLPs) GNNto-MLP Knowledge Distillation (KD) proposes to distill knowledge from a well-trained teacher GNN into a student. This paper proposes a simple yet effective Hardness-aware GNN-to-MLP Distillation (HGMD) framework.
arXiv Detail & Related papers (2024-07-20T06:13:00Z)
Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation. At the sequence level, we propose a sequence correction and re-generation strategy. At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function. At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
Understanding the Gains from Repeated Self-Distillation [65.53673000292079]
Self-Distillation is a type of knowledge distillation where the student model has the same architecture as the teacher model. We show that the excess risk achieved by multi-step self-distillation can significantly improve upon a single step of self-distillation. Empirical results on regression tasks from the UCI repository show a reduction in the learnt model's risk (MSE) by up to 47%.
arXiv Detail & Related papers (2024-07-05T15:48:34Z)
Unveiling the Unseen Potential of Graph Learning through MLPs: Effective Graph Learners Using Propagation-Embracing MLPs [9.731314045194495]
We train a student by knowledge distillation from a teacher neural network (GNN) Inspired by GNNs that separate transformation $T$ and propagation $Pi$, we re-frame the KD process as enabling the student to explicitly learn both $T$ and $Pi$. We propose Propagate & Distill (P&D), which propagates the output of the teacher GNN before KD and can be interpreted as an approximate process of the inverse propagation $Pi-1$.
arXiv Detail & Related papers (2023-11-20T13:39:19Z)
Extracting Low-/High- Frequency Knowledge from Graph Neural Networks and Injecting it into MLPs: An Effective GNN-to-MLP Distillation Framework [36.160251860788314]
We propose an efficient Full-Frequency GNN-to-MLP (FFG2M) distillation framework. We factorize the knowledge learned by GNNs into low- and high-frequency components in the spectral domain. We identify a potential information drowning problem for existing GNN-to-MLP distillation.
arXiv Detail & Related papers (2023-05-18T06:57:06Z)
Robust Knowledge Distillation from RNN-T Models With Noisy Training Labels Using Full-Sum Loss [32.816725317261934]
This work studies knowledge distillation (KD) and addresses its constraints for recurrent neural network transducer (RNN-T) models. We show that a sequence-level KD, full-sum distillation, outperforms other distillation methods for RNN-T models. We also propose a variant of full-sum distillation that distills the sequence discriminative knowledge of the teacher leading to further improvement in WER.
arXiv Detail & Related papers (2023-03-10T14:46:23Z)
HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers [49.79405257763856]
This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. We propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning.
arXiv Detail & Related papers (2023-02-19T17:37:24Z)
Unbiased Knowledge Distillation for Recommendation [66.82575287129728]
Knowledge distillation (KD) has been applied in recommender systems (RS) to reduce inference latency. Traditional solutions first train a full teacher model from the training data, and then transfer its knowledge to supervise the learning of a compact student model. We find such a standard distillation paradigm would incur serious bias issue -- popular items are more heavily recommended after the distillation.
arXiv Detail & Related papers (2022-11-27T05:14:03Z)
Distilling Object Detectors with Task Adaptive Regularization [97.52935611385179]
Current state-of-the-art object detectors are at the expense of high computational costs and are hard to deploy to low-end devices. Knowledge distillation, which aims at training a smaller student network by transferring knowledge from a larger teacher model, is one of the promising solutions for model miniaturization.
arXiv Detail & Related papers (2020-06-23T15:58:22Z)

This list is automatically generated from the titles and abstracts of the papers in this site.