RecPipe: Co-designing Models and Hardware to Jointly Optimize
Recommendation Quality and Performance
- URL: http://arxiv.org/abs/2105.08820v1
- Date: Tue, 18 May 2021 20:44:04 GMT
- Title: RecPipe: Co-designing Models and Hardware to Jointly Optimize
Recommendation Quality and Performance
- Authors: Udit Gupta, Samuel Hsia, Jeff (Jun) Zhang, Mark Wilkening, Javin
Pombra, Hsien-Hsin S. Lee, Gu-Yeon Wei, Carole-Jean Wu, David Brooks
- Abstract summary: RecPipe is a system to jointly optimize recommendation quality and inference performance.
RPAccel is a custom accelerator that jointly optimize quality, tail-latency, and system throughput.
- Score: 6.489720534548981
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation systems must provide high quality, personalized
content under strict tail-latency targets and high system loads. This paper
presents RecPipe, a system to jointly optimize recommendation quality and
inference performance. Central to RecPipe is decomposing recommendation models
into multi-stage pipelines to maintain quality while reducing compute
complexity and exposing distinct parallelism opportunities. RecPipe implements
an inference scheduler to map multi-stage recommendation engines onto
commodity, heterogeneous platforms (e.g., CPUs, GPUs).While the hardware-aware
scheduling improves ranking efficiency, the commodity platforms suffer from
many limitations requiring specialized hardware. Thus, we design RecPipeAccel
(RPAccel), a custom accelerator that jointly optimizes quality, tail-latency,
and system throughput. RPAc-cel is designed specifically to exploit the
distinct design space opened via RecPipe. In particular, RPAccel processes
queries in sub-batches to pipeline recommendation stages, implements dual
static and dynamic embedding caches, a set of top-k filtering units, and a
reconfigurable systolic array. Com-pared to prior-art and at iso-quality, we
demonstrate that RPAccel improves latency and throughput by 3x and 6x.
Related papers
- Controllable Prompt Tuning For Balancing Group Distributional Robustness [53.336515056479705]
We introduce an optimization scheme to achieve good performance across groups and find a good solution for all without severely sacrificing performance on any of them.
We propose Controllable Prompt Tuning (CPT), which couples our approach with prompt-tuning techniques.
On spurious correlation benchmarks, our procedures achieve state-of-the-art results across both transformer and non-transformer architectures, as well as unimodal and multimodal data.
arXiv Detail & Related papers (2024-03-05T06:23:55Z) - Analyzing and Enhancing the Backward-Pass Convergence of Unrolled
Optimization [50.38518771642365]
The integration of constrained optimization models as components in deep networks has led to promising advances on many specialized learning tasks.
A central challenge in this setting is backpropagation through the solution of an optimization problem, which often lacks a closed form.
This paper provides theoretical insights into the backward pass of unrolled optimization, showing that it is equivalent to the solution of a linear system by a particular iterative method.
A system called Folded Optimization is proposed to construct more efficient backpropagation rules from unrolled solver implementations.
arXiv Detail & Related papers (2023-12-28T23:15:18Z) - Reconfigurable Distributed FPGA Cluster Design for Deep Learning
Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications.
The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z) - MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation [8.070008246742681]
State-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences.
We show how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance.
arXiv Detail & Related papers (2023-02-21T18:38:45Z) - Tailored Learning-Based Scheduling for Kubernetes-Oriented Edge-Cloud
System [54.588242387136376]
We introduce KaiS, a learning-based scheduling framework for edge-cloud systems.
First, we design a coordinated multi-agent actor-critic algorithm to cater to decentralized request dispatch.
Second, for diverse system scales and structures, we use graph neural networks to embed system state information.
Third, we adopt a two-time-scale scheduling mechanism to harmonize request dispatch and service orchestration.
arXiv Detail & Related papers (2021-01-17T03:45:25Z) - MLComp: A Methodology for Machine Learning-based Performance Estimation
and Adaptive Selection of Pareto-Optimal Compiler Optimization Sequences [10.200899224740871]
We propose a novel Reinforcement Learning-based policy methodology for embedded software optimization.
We show that different Machine Learning models are automatically tested to choose the best-fitting one.
We also show that our framework can be trained efficiently for any target platform and application domain.
arXiv Detail & Related papers (2020-12-09T19:13:39Z) - Adaptive pruning-based optimization of parameterized quantum circuits [62.997667081978825]
Variisy hybrid quantum-classical algorithms are powerful tools to maximize the use of Noisy Intermediate Scale Quantum devices.
We propose a strategy for such ansatze used in variational quantum algorithms, which we call "Efficient Circuit Training" (PECT)
Instead of optimizing all of the ansatz parameters at once, PECT launches a sequence of variational algorithms.
arXiv Detail & Related papers (2020-10-01T18:14:11Z) - Multi-level Training and Bayesian Optimization for Economical
Hyperparameter Optimization [12.92634461859467]
In this paper, we develop an effective approach to reducing the total amount of required training time for Hyperparameter Optimization.
We propose a truncated additive Gaussian process model to calibrate approximate performance measurements generated by light training.
Based on the model, a sequential model-based algorithm is developed to generate the performance profile of the configuration space as well as find optimal ones.
arXiv Detail & Related papers (2020-07-20T09:03:02Z) - Sapphire: Automatic Configuration Recommendation for Distributed Storage
Systems [11.713288567936875]
tuning parameters can provide significant performance gains but is a difficult task requiring profound experience and expertise.
We propose an automatic simulation-based approach, Sapphire, to recommend optimal configurations.
Results show that Sapphire significantly boosts Ceph performance to 2.2x compared to the default configuration.
arXiv Detail & Related papers (2020-07-07T06:17:07Z) - A Generic Network Compression Framework for Sequential Recommender
Systems [71.81962915192022]
Sequential recommender systems (SRS) have become the key technology in capturing user's dynamic interests and generating high-quality recommendations.
We propose a compressed sequential recommendation framework, termed as CpRec, where two generic model shrinking techniques are employed.
By the extensive ablation studies, we demonstrate that the proposed CpRec can achieve up to 4$sim$8 times compression rates in real-world SRS datasets.
arXiv Detail & Related papers (2020-04-21T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.