Related papers: MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation

MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation

URL: http://arxiv.org/abs/2302.10872v1
Date: Tue, 21 Feb 2023 18:38:45 GMT
Title: MP-Rec: Hardware-Software Co-Design to Enable Multi-Path Recommendation
Authors: Samuel Hsia, Udit Gupta, Bilge Acun, Newsha Ardalani, Pan Zhong, Gu-Yeon Wei, David Brooks, Carole-Jean Wu
Abstract summary: State-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences. We show how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance.
Score: 8.070008246742681
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning recommendation systems serve personalized content under diverse tail-latency targets and input-query loads. In order to do so, state-of-the-art recommendation models rely on terabyte-scale embedding tables to learn user preferences over large bodies of contents. The reliance on a fixed embedding representation of embedding tables not only imposes significant memory capacity and bandwidth requirements but also limits the scope of compatible system solutions. This paper challenges the assumption of fixed embedding representations by showing how synergies between embedding representations and hardware platforms can lead to improvements in both algorithmic- and system performance. Based on our characterization of various embedding representations, we propose a hybrid embedding representation that achieves higher quality embeddings at the cost of increased memory and compute requirements. To address the system performance challenges of the hybrid representation, we propose MP-Rec -- a co-design technique that exploits heterogeneity and dynamic selection of embedding representations and underlying hardware platforms. On real system hardware, we demonstrate how matching custom accelerators, i.e., GPUs, TPUs, and IPUs, with compatible embedding representations can lead to 16.65x performance speedup. Additionally, in query-serving scenarios, MP-Rec achieves 2.49x and 3.76x higher correct prediction throughput and 0.19% and 0.22% better model quality on a CPU-GPU system for the Kaggle and Terabyte datasets, respectively.

Related papers

Image-GS: Content-Adaptive Image Representation via 2D Gaussians [55.15950594752051]
We propose Image-GS, a content-adaptive image representation. Using anisotropic 2D Gaussians as the basis, Image-GS shows high memory efficiency, supports fast random access, and offers a natural level of detail stack. General efficiency and fidelity of Image-GS are validated against several recent neural image representations and industry-standard texture compressors. We hope this research offers insights for developing new applications that require adaptive quality and resource control, such as machine perception, asset streaming, and content generation.
arXiv Detail & Related papers (2024-07-02T00:45:21Z)
Compound Text-Guided Prompt Tuning via Image-Adaptive Cues [42.248853198953945]
We propose Compound Text-Guided Prompt Tuning (TGP-T) It significantly reduces resource demand while achieving superior performance. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet.
arXiv Detail & Related papers (2023-12-11T14:17:02Z)
Joint Modeling of Feature, Correspondence, and a Compressed Memory for Video Object Segmentation [52.11279360934703]
Current prevailing Video Object (VOS) methods usually perform dense matching between the current and reference frames after extracting features. We propose a unified VOS framework, coined as JointFormer, for joint modeling of the three elements of feature, correspondence, and a compressed memory.
arXiv Detail & Related papers (2023-08-25T17:30:08Z)
Unified Embedding: Battle-Tested Feature Representations for Web-Scale ML Systems [29.53535556926066]
Learning high-quality feature embeddings efficiently and effectively is critical for the performance of web-scale machine learning systems. This work introduces a simple yet highly effective framework, Feature Multiplexing, where one single representation space is used across many different categorical features. We propose a highly practical approach called Unified Embedding with three major benefits: simplified feature configuration, strong adaptation to dynamic data distributions, and compatibility with modern hardware.
arXiv Detail & Related papers (2023-05-20T05:35:40Z)
Binarized Spectral Compressive Imaging [59.18636040850608]
Existing deep learning models for hyperspectral image (HSI) reconstruction achieve good performance but require powerful hardwares with enormous memory and computational resources. We propose a novel method, Binarized Spectral-Redistribution Network (BiSRNet) BiSRNet is derived by using the proposed techniques to binarize the base model.
arXiv Detail & Related papers (2023-05-17T15:36:08Z)
Mem-Rec: Memory Efficient Recommendation System using Alternative Representation [6.542635536704625]
MEM-REC is a novel alternative representation approach for embedding tables. We show that MEM-REC can not only maintain the recommendation quality but can also improve the embedding latency.
arXiv Detail & Related papers (2023-05-12T02:36:07Z)
Energy-efficient Task Adaptation for NLP Edge Inference Leveraging Heterogeneous Memory Architectures [68.91874045918112]
adapter-ALBERT is an efficient model optimization for maximal data reuse across different tasks. We demonstrate the advantage of mapping the model to a heterogeneous on-chip memory architecture by performing simulations on a validated NLP edge accelerator.
arXiv Detail & Related papers (2023-03-25T14:40:59Z)
A GPU-specialized Inference Parameter Server for Large-Scale Deep Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc. To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data. Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z)
RecPipe: Co-designing Models and Hardware to Jointly Optimize Recommendation Quality and Performance [6.489720534548981]
RecPipe is a system to jointly optimize recommendation quality and inference performance. RPAccel is a custom accelerator that jointly optimize quality, tail-latency, and system throughput.
arXiv Detail & Related papers (2021-05-18T20:44:04Z)
Compatibility-aware Heterogeneous Visual Search [93.90831195353333]
Existing systems use the same embedding model to compute representations (embeddings) for the query and gallery images. We address two forms of compatibility: One enforced by modifying the parameters of each model that computes the embeddings, the other by modifying the architectures that compute the embeddings. Compared to ordinary (homogeneous) visual search using the largest embedding model (paragon), CMP-NAS achieves 80-fold and 23-fold cost reduction.
arXiv Detail & Related papers (2021-05-13T02:30:50Z)
ResNeSt: Split-Attention Networks [86.25490825631763]
We present a modularized architecture, which applies the channel-wise attention on different network branches to leverage their success in capturing cross-feature interactions and learning diverse representations. Our model, named ResNeSt, outperforms EfficientNet in accuracy and latency trade-off on image classification.
arXiv Detail & Related papers (2020-04-19T20:40:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.