Related papers: Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems

URL: http://arxiv.org/abs/2506.11421v2
Date: Tue, 17 Jun 2025 17:08:47 GMT
Title: Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems
Authors: Junli Shao, Jing Dong, Dingzhou Wang, Kowei Shih, Dannier Li, Chengrui Zhou,
Abstract summary: Key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality.<n>This paper proposes a combined set of modeling- and system-level acceleration and optimization strategies.<n> Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput.
Score: 1.9316786310787222
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rapid growth of Internet services, recommendation systems play a central role in delivering personalized content. Faced with massive user requests and complex model architectures, the key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality. This paper addresses the high computational cost and resource bottlenecks of deep learning models in real-time settings by proposing a combined set of modeling- and system-level acceleration and optimization strategies. At the model level, we dramatically reduce parameter counts and compute requirements through lightweight network design, structured pruning, and weight quantization. At the system level, we integrate multiple heterogeneous compute platforms and high-performance inference libraries, and we design elastic inference scheduling and load-balancing mechanisms based on real-time load characteristics. Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput, offering a practical solution for deploying large-scale online recommendation services.

Related papers

Scalability Optimization in Cloud-Based AI Inference Services: Strategies for Real-Time Load Balancing and Automated Scaling [1.3689475854650441]
This study proposes a comprehensive scalability optimization framework for cloud AI inference services.<n>The proposed model is a hybrid approach that combines reinforcement learning for adaptive load distribution and deep neural networks for accurate demand forecasting.<n> Experimental results demonstrate that the proposed model enhances load balancing efficiency by 35 and reduces response delay by 28.
arXiv Detail & Related papers (2025-04-16T04:00:04Z)
The Efficiency vs. Accuracy Trade-off: Optimizing RAG-Enhanced LLM Recommender Systems Using Multi-Head Early Exit [46.37267466656765]
This paper presents an optimization framework that combines Retrieval-Augmented Generation (RAG) with an innovative multi-head early exit architecture.<n>Our experiments demonstrate how this architecture effectively decreases time without sacrificing the accuracy needed for reliable recommendation delivery.
arXiv Detail & Related papers (2025-01-04T03:26:46Z)
A Survey on Inference Optimization Techniques for Mixture of Experts Models [50.40325411764262]
Large-scale Mixture of Experts (MoE) models offer enhanced model capacity and computational efficiency through conditional computation.<n> deploying and running inference on these models presents significant challenges in computational resources, latency, and energy efficiency.<n>This survey analyzes optimization techniques for MoE models across the entire system stack.
arXiv Detail & Related papers (2024-12-18T14:11:15Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Online Client Scheduling and Resource Allocation for Efficient Federated Edge Learning [9.451084740123198]
Federated learning (FL) enables edge devices to collaboratively train a machine learning model without sharing their raw data. However, deploying FL over mobile edge networks with constrained resources such as power, bandwidth, and suffers from high training latency and low model accuracy. This paper investigates the optimal client scheduling and resource allocation for FL over mobile edge networks under resource constraints and uncertainty.
arXiv Detail & Related papers (2024-09-29T01:56:45Z)
Dynamic Sparse Learning: A Novel Paradigm for Efficient Recommendation [20.851925464903804]
This paper introduces a novel learning paradigm, Dynamic Sparse Learning, tailored for recommendation models. DSL innovatively trains a lightweight sparse model from scratch, periodically evaluating and dynamically adjusting each weight's significance. Our experimental results underline DSL's effectiveness, significantly reducing training and inference costs while delivering comparable recommendation performance.
arXiv Detail & Related papers (2024-02-05T10:16:20Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
COMET: A Comprehensive Cluster Design Methodology for Distributed Deep Learning Training [42.514897110537596]
Modern Deep Learning (DL) models have grown to sizes requiring massive clusters of specialized, high-end nodes to train. designing such clusters to maximize both performance and utilization--to amortize their steep cost--is a challenging task. We introduce COMET, a holistic cluster design methodology and workflow to jointly study the impact of parallelization strategies and key cluster resource provisioning on the performance of distributed DL training.
arXiv Detail & Related papers (2022-11-30T00:32:37Z)
Online Convolutional Re-parameterization [51.97831675242173]
We present online convolutional re- parameterization (OREPA), a two-stage pipeline, aiming to reduce the huge training overhead by squeezing the complex training-time block into a single convolution. Compared with the state-of-the-art re-param models, OREPA is able to save the training-time memory cost by about 70% and accelerate the training speed by around 2x. We also conduct experiments on object detection and semantic segmentation and show consistent improvements on the downstream tasks.
arXiv Detail & Related papers (2022-04-02T09:50:19Z)
Online Learning for Orchestration of Inference in Multi-User End-Edge-Cloud Networks [3.6076391721440633]
Collaborative end-edge-cloud computing for deep learning provides a range of performance and efficiency. We propose a reinforcement-learning-based computation offloading solution that learns optimal offloading policy. Our solution provides 35% speedup in the average response time compared to the state-of-the-art with less than 0.9% accuracy reduction.
arXiv Detail & Related papers (2022-02-21T21:41:29Z)
Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch. Second, for diverse system scales and structures, we use graph neural networks to embed system state information. Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)
Optimization-driven Machine Learning for Intelligent Reflecting Surfaces Assisted Wireless Networks [82.33619654835348]
Intelligent surface (IRS) has been employed to reshape the wireless channels by controlling individual scattering elements' phase shifts. Due to the large size of scattering elements, the passive beamforming is typically challenged by the high computational complexity. In this article, we focus on machine learning (ML) approaches for performance in IRS-assisted wireless networks.
arXiv Detail & Related papers (2020-08-29T08:39:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.