Related papers: RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance

URL: http://arxiv.org/abs/2105.08820v1
Date: Tue, 18 May 2021 20:44:04 GMT
Title: RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance
Authors: Udit Gupta, Samuel Hsia, Jeff (Jun) Zhang, Mark Wilkening, Javin Pombra, Hsien-Hsin S. Lee, Gu-Yeon Wei, Carole-Jean Wu, David Brooks
Abstract summary: RecPipe is a system to jointly optimize recommendation quality and inference performance. RPAccel is a custom accelerator that jointly optimize quality, tail-latency, and system throughput.
Score: 6.489720534548981
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Deep learning recommendation systems must provide high quality, personalized content under strict tail-latency targets and high system loads. This paper presents RecPipe, a system to jointly optimize recommendation quality and inference performance. Central to RecPipe is decomposing recommendation models into multi-stage pipelines to maintain quality while reducing compute complexity and exposing distinct parallelism opportunities. RecPipe implements an inference scheduler to map multi-stage recommendation engines onto commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware scheduling improves ranking efficiency, the commodity platforms suffer from many limitations requiring specialized hardware. Thus, we design RecPipeAccel (RPAccel), a custom accelerator that jointly optimizes quality, tail-latency, and system throughput. RPAc-cel is designed specifically to exploit the distinct design space opened via RecPipe. In particular, RPAccel processes queries in sub-batches to pipeline recommendation stages, implements dual static and dynamic embedding caches, a set of top-k filtering units, and a reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we demonstrate that RPAccel improves latency and throughput by 3x and 6x.

Related papers

Optimas: Optimizing Compound AI Systems with Globally Aligned Local Rewards [95.19837878559456]
We propose Optimas, a unified framework for effective optimization of compound systems.<n>In each iteration, Optimas efficiently adapts the Local Reward Function (LRF) to maintain this property while simultaneously maximizing each component's local reward.<n>We present extensive evaluations across five real-world compound systems to demonstrate that Optimas outperforms strong baselines by an average improvement of 11.92%.
arXiv Detail & Related papers (2025-07-03T07:12:48Z)
Deep Learning Model Acceleration and Optimization Strategies for Real-Time Recommendation Systems [1.9316786310787222]
Key challenge for real-time recommendation systems is how to reduce inference latency and increase system throughput without sacrificing recommendation quality.<n>This paper proposes a combined set of modeling- and system-level acceleration and optimization strategies.<n> Experiments show that, while maintaining the original recommendation accuracy, our methods cut latency to less than 30% of the baseline and more than double system throughput.
arXiv Detail & Related papers (2025-06-13T02:39:21Z)
AI-Driven Optimization of Hardware Overlay Configurations [0.0]
This paper presents an AI-driven approach to optimizing FPGA overlay configurations. By leveraging machine learning techniques, we predict the feasibility and efficiency of different configurations before hardware compilation.
arXiv Detail & Related papers (2025-03-08T22:34:47Z)
HEPPO: Hardware-Efficient Proximal Policy Optimization -- A Universal Pipelined Architecture for Generalized Advantage Estimation [0.0]
HEPPO is an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation stage in Proximal Policy Optimization. Key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization. Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency.
arXiv Detail & Related papers (2025-01-22T08:18:56Z)
Towards Automated Model Design on Recommender Systems [21.421326082345136]
We introduce a novel paradigm that utilizes weight sharing to explore abundant solution spaces. From a co-design perspective, we achieve 2x FLOPs efficiency, 1.8x energy efficiency, and 1.5x performance improvements in recommender models.
arXiv Detail & Related papers (2024-11-12T06:03:47Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
DiSK: Differentially Private Optimizer with Simplified Kalman Filter for Noise Reduction [57.83978915843095]
This paper introduces DiSK, a novel framework designed to significantly enhance the performance of differentially private gradients. To ensure practicality for large-scale training, we simplify the Kalman filtering process, minimizing its memory and computational demands.
arXiv Detail & Related papers (2024-10-04T19:30:39Z)
Generative Recommender with End-to-End Learnable Item Tokenization [51.82768744368208]
We introduce ETEGRec, a novel End-To-End Generative Recommender that unifies item tokenization and generative recommendation into a cohesive framework.<n>ETEGRec consists of an item tokenizer and a generative recommender built on a dual encoder-decoder architecture.<n>We develop an alternating optimization technique to ensure stable and efficient end-to-end training of the entire framework.
arXiv Detail & Related papers (2024-09-09T12:11:53Z)
Analyzing and Enhancing the Backward-Pass Convergence of Unrolled Optimization [50.38518771642365]
The integration of constrained optimization models as components in deep networks has led to promising advances on many specialized learning tasks. A central challenge in this setting is backpropagation through the solution of an optimization problem, which often lacks a closed form. This paper provides theoretical insights into the backward pass of unrolled optimization, showing that it is equivalent to the solution of a linear system by a particular iterative method. A system called Folded Optimization is proposed to construct more efficient backpropagation rules from unrolled solver implementations.
arXiv Detail & Related papers (2023-12-28T23:15:18Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation [8.070008246742681]
State-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences. We show how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance.
arXiv Detail & Related papers (2023-02-21T18:38:45Z)
Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems. First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch. Second, for diverse system scales and structures, we use graph neural networks to embed system state information. Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z)
MLComp: A Methodology for Machine Learning-based Performance Estimation and Adaptive Selection of Pareto-Optimal Compiler Optimization Sequences [10.200899224740871]
We propose a novel Reinforcement Learning-based policy methodology for embedded software optimization. We show that different Machine Learning models are automatically tested to choose the best-fitting one. We also show that our framework can be trained efficiently for any target platform and application domain.
arXiv Detail & Related papers (2020-12-09T19:13:39Z)
Sapphire: Automatic Configuration Recommendation for Distributed Storage Systems [11.713288567936875]
tuning parameters can provide significant performance gains but is a difficult task requiring profound experience and expertise. We propose an automatic simulation-based approach, Sapphire, to recommend optimal configurations. Results show that Sapphire significantly boosts Ceph performance to 2.2x compared to the default configuration.
arXiv Detail & Related papers (2020-07-07T06:17:07Z)
A Generic Network Compression Framework for Sequential Recommender Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations. We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed. By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)

This list is automatically generated from the titles and abstracts of the papers in this site.