Related papers: Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search

URL: http://arxiv.org/abs/2107.07173v1
Date: Thu, 15 Jul 2021 07:47:46 GMT
Title: Scene-adaptive Knowledge Distillation for Sequential Recommendation via Differentiable Architecture Search
Authors: Lei Chen, Fajie Yuan, Jiaxi Yang, Min Yang, and Chengming Li
Abstract summary: Sequential recommender systems (SRS) have become a research hotspot due to its power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture. We propose AdaRec, a framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene.
Score: 19.798931417466456
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Sequential recommender systems (SRS) have become a research hotspot due to its power in modeling user dynamic interests and sequential behavioral patterns. To maximize model expressive ability, a default choice is to apply a larger and deeper network architecture, which, however, often brings high network latency when generating online recommendations. Naturally, we argue that compressing the heavy recommendation models into middle- or light- weight neural networks is of great importance for practical production systems. To realize such a goal, we propose AdaRec, a knowledge distillation (KD) framework which compresses knowledge of a teacher model into a student model adaptively according to its recommendation scene by using differentiable Neural Architecture Search (NAS). Specifically, we introduce a target-oriented distillation loss to guide the structure search process for finding the student network architecture, and a cost-sensitive loss as constraints for model size, which achieves a superior trade-off between recommendation effectiveness and efficiency. In addition, we leverage Earth Mover's Distance (EMD) to realize many-to-many layer mapping during knowledge distillation, which enables each intermediate student layer to learn from other intermediate teacher layers adaptively. Extensive experiments on real-world recommendation datasets demonstrate that our model achieves competitive or better accuracy with notable inference speedup comparing to strong counterparts, while discovering diverse neural architectures for sequential recommender models under different recommendation scenes.

Related papers

LARES: Latent Reasoning for Sequential Recommendation [96.26996622771593]
We present LARES, a novel and scalable LAtent REasoning framework for Sequential recommendation.<n>Our proposed approach employs a recurrent architecture that allows flexible expansion of reasoning depth without increasing parameter complexity.<n>Our framework exhibits seamless compatibility with existing advanced models, further improving their recommendation performance.
arXiv Detail & Related papers (2025-05-22T16:22:54Z)
Empirical Evaluation of Knowledge Distillation from Transformers to Subquadratic Language Models [3.287942619833188]
We systematically evaluate the transferability of knowledge distillation from a Transformer teacher model to eight subquadratic student architectures.<n>Our study investigates which subquadratic model can most effectively approximate the teacher model's learned representations through knowledge distillation.
arXiv Detail & Related papers (2025-04-19T17:49:52Z)
Visual Prompting Upgrades Neural Network Sparsification: A Data-Model Perspective [64.04617968947697]
We introduce a novel data-model co-design perspective: to promote superior weight sparsity. Specifically, customized Visual Prompts are mounted to upgrade neural Network sparsification in our proposed VPNs framework.
arXiv Detail & Related papers (2023-12-03T13:50:24Z)
One-for-All: Bridge the Gap Between Heterogeneous Architectures in Knowledge Distillation [69.65734716679925]
Knowledge distillation has proven to be a highly effective approach for enhancing model performance through a teacher-student training scheme. Most existing distillation methods are designed under the assumption that the teacher and student models belong to the same model family. We propose a simple yet effective one-for-all KD framework called OFA-KD, which significantly improves the distillation performance between heterogeneous architectures.
arXiv Detail & Related papers (2023-10-30T11:13:02Z)
EmbedDistill: A Geometric Knowledge Distillation for Information Retrieval [83.79667141681418]
Large neural models (such as Transformers) achieve state-of-the-art performance for information retrieval (IR) We propose a novel distillation approach that leverages the relative geometry among queries and documents learned by the large teacher model. We show that our approach successfully distills from both dual-encoder (DE) and cross-encoder (CE) teacher models to 1/10th size asymmetric students that can retain 95-97% of the teacher performance.
arXiv Detail & Related papers (2023-01-27T22:04:37Z)
Directed Acyclic Graph Factorization Machines for CTR Prediction via Knowledge Distillation [65.62538699160085]
We propose a Directed Acyclic Graph Factorization Machine (KD-DAGFM) to learn the high-order feature interactions from existing complex interaction models for CTR prediction via Knowledge Distillation. KD-DAGFM achieves the best performance with less than 21.5% FLOPs of the state-of-the-art method on both online and offline experiments.
arXiv Detail & Related papers (2022-11-21T03:09:42Z)
Rethinking Pareto Frontier for Performance Evaluation of Deep Neural Networks [2.167843405313757]
We re-define the efficiency measure using a multi-objective optimization. We combine competing variables with nature simultaneously in a single relative efficiency measure. This allows to rank deep models that run efficiently on different computing hardware, and combines inference efficiency with training efficiency objectively.
arXiv Detail & Related papers (2022-02-18T15:58:17Z)
Guided Sampling-based Evolutionary Deep Neural Network for Intelligent Fault Diagnosis [8.92307560991779]
We have proposed a novel framework of evolutionary deep neural network which uses policy gradient to guide the evolution of model architecture. The effectiveness of the proposed framework has been validated on three datasets.
arXiv Detail & Related papers (2021-11-12T18:59:45Z)
Follow Your Path: a Progressive Method for Knowledge Distillation [23.709919521355936]
We propose ProKT, a new model-agnostic method by projecting the supervision signals of a teacher model into the student's parameter space. Experiments on both image and text datasets show that our proposed ProKT consistently achieves superior performance compared to other existing knowledge distillation methods.
arXiv Detail & Related papers (2021-07-20T07:44:33Z)
Hybrid Model with Time Modeling for Sequential Recommender Systems [0.15229257192293202]
Booking.com organized the WSDM WebTour 2021 Challenge, which aims to benchmark models to recommend the final city in a trip. We conducted several experiments to test different state-of-the-art deep learning architectures for recommender systems. Our experimental result shows that the improved NARM outperforms all other state-of-the-art benchmark methods.
arXiv Detail & Related papers (2021-03-07T19:28:22Z)
Self-Supervised Reinforcement Learning for Recommender Systems [77.38665506495553]
We propose self-supervised reinforcement learning for sequential recommendation tasks. Our approach augments standard recommendation models with two output layers: one for self-supervised learning and the other for RL. Based on such an approach, we propose two frameworks namely Self-Supervised Q-learning(SQN) and Self-Supervised Actor-Critic(SAC)
arXiv Detail & Related papers (2020-06-10T11:18:57Z)
Belief Propagation Reloaded: Learning BP-Layers for Labeling Problems [83.98774574197613]
We take one of the simplest inference methods, a truncated max-product Belief propagation, and add what is necessary to make it a proper component of a deep learning model. This BP-Layer can be used as the final or an intermediate block in convolutional neural networks (CNNs) The model is applicable to a range of dense prediction problems, is well-trainable and provides parameter-efficient and robust solutions in stereo, optical flow and semantic segmentation.
arXiv Detail & Related papers (2020-03-13T13:11:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.