Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation
- URL: http://arxiv.org/abs/2211.11159v1
- Date: Mon, 21 Nov 2022 03:09:42 GMT
- Title: Directed Acyclic Graph Factorization Machines for CTR Prediction via
Knowledge Distillation
- Authors: Zhen Tian, Ting Bai, Zibin Zhang, Zhiyuan Xu, Kangyi Lin, Ji-Rong Wen
and Wayne Xin Zhao
- Abstract summary: We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation.
KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
- Score: 65.62538699160085
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: With the growth of high-dimensional sparse data in web-scale recommender
systems, the computational cost to learn high-order feature interaction in CTR
prediction task largely increases, which limits the use of high-order
interaction models in real industrial applications. Some recent knowledge
distillation based methods transfer knowledge from complex teacher models to
shallow student models for accelerating the online model inference. However,
they suffer from the degradation of model accuracy in knowledge distillation
process. It is challenging to balance the efficiency and effectiveness of the
shallow student models. To address this problem, we propose a Directed Acyclic
Graph Factorization Machine (KD-DAGFM) to learn the high-order feature
interactions from existing complex interaction models for CTR prediction via
Knowledge Distillation. The proposed lightweight student model DAGFM can learn
arbitrary explicit feature interactions from teacher networks, which achieves
approximately lossless performance and is proved by a dynamic programming
algorithm. Besides, an improved general model KD-DAGFM+ is shown to be
effective in distilling both explicit and implicit feature interactions from
any complex teacher model. Extensive experiments are conducted on four
real-world datasets, including a large-scale industrial dataset from WeChat
platform with billions of feature dimensions. KD-DAGFM achieves the best
performance with less than 21.5% FLOPs of the state-of-the-art method on both
online and offline experiments, showing the superiority of DAGFM to deal with
the industrial scale data in CTR prediction task. Our implementation code is
available at: https://github.com/RUCAIBox/DAGFM.
Related papers
- Feature Interaction Fusion Self-Distillation Network For CTR Prediction [14.12775753361368]
Click-Through Rate (CTR) prediction plays a vital role in recommender systems, online advertising, and search engines.
We propose FSDNet, a CTR prediction framework incorporating a plug-and-play fusion self-distillation module.
arXiv Detail & Related papers (2024-11-12T03:05:03Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance
and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.
Most existing KD techniques rely on Kullback-Leibler (KL) divergence.
We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Robust Learning with Progressive Data Expansion Against Spurious
Correlation [65.83104529677234]
We study the learning process of a two-layer nonlinear convolutional neural network in the presence of spurious features.
Our analysis suggests that imbalanced data groups and easily learnable spurious features can lead to the dominance of spurious features during the learning process.
We propose a new training algorithm called PDE that efficiently enhances the model's robustness for a better worst-group performance.
arXiv Detail & Related papers (2023-06-08T05:44:06Z) - Data-Free Adversarial Knowledge Distillation for Graph Neural Networks [62.71646916191515]
We propose the first end-to-end framework for data-free adversarial knowledge distillation on graph structured data (DFAD-GNN)
To be specific, our DFAD-GNN employs a generative adversarial network, which mainly consists of three components: a pre-trained teacher model and a student model are regarded as two discriminators, and a generator is utilized for deriving training graphs to distill knowledge from the teacher model into the student model.
Our DFAD-GNN significantly surpasses state-of-the-art data-free baselines in the graph classification task.
arXiv Detail & Related papers (2022-05-08T08:19:40Z) - Distill2Vec: Dynamic Graph Representation Learning with Knowledge
Distillation [4.568777157687959]
We propose Distill2Vec, a knowledge distillation strategy to train a compact model with a low number of trainable parameters.
Our experiments with publicly available datasets show the superiority of our proposed model over several state-of-the-art approaches.
arXiv Detail & Related papers (2020-11-11T09:49:24Z) - Ensemble Knowledge Distillation for CTR Prediction [46.92149090885551]
We propose a new model training strategy based on knowledge distillation (KD)
KD is a teacher-student learning framework to transfer knowledge learned from a teacher model to a student model.
We propose some novel techniques to facilitate ensembled CTR prediction, including teacher gating and early stopping by distillation loss.
arXiv Detail & Related papers (2020-11-08T23:37:58Z) - MixKD: Towards Efficient Distillation of Large-scale Language Models [129.73786264834894]
We propose MixKD, a data-agnostic distillation framework, to endow the resulting model with stronger generalization ability.
We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the error and the empirical error.
Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.
arXiv Detail & Related papers (2020-11-01T18:47:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.