Low-Precision Hardware Architectures Meet Recommendation Model Inference
at Scale
- URL: http://arxiv.org/abs/2105.12676v1
- Date: Wed, 26 May 2021 16:42:33 GMT
- Title: Low-Precision Hardware Architectures Meet Recommendation Model Inference
at Scale
- Authors: Zhaoxia (Summer) Deng, Jongsoo Park, Ping Tak Peter Tang, Haixin Liu,
Jie (Amy) Yang, Hector Yuen, Jianyu Huang, Daya Khudia, Xiaohan Wei, Ellie
Wen, Dhruv Choudhary, Raghuraman Krishnamoorthi, Carole-Jean Wu, Satish
Nadathur, Changkyu Kim, Maxim Naumov, Sam Naghshineh, Mikhail Smelyanskiy
- Abstract summary: We share in this paper our search strategies to adapt reference recommendation models to low-precision hardware.
We also discuss the design and development of tool chain so as to maintain our models' accuracy throughout their lifespan.
We believe these lessons from the trenches promote better co-design between hardware architecture and software engineering.
- Score: 11.121380180647769
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tremendous success of machine learning (ML) and the unabated growth in ML
model complexity motivated many ML-specific designs in both CPU and accelerator
architectures to speed up the model inference. While these architectures are
diverse, highly optimized low-precision arithmetic is a component shared by
most. Impressive compute throughputs are indeed often exhibited by these
architectures on benchmark ML models. Nevertheless, production models such as
recommendation systems important to Facebook's personalization services are
demanding and complex: These systems must serve billions of users per month
responsively with low latency while maintaining high prediction accuracy,
notwithstanding computations with many tens of billions parameters per
inference. Do these low-precision architectures work well with our production
recommendation systems? They do. But not without significant effort. We share
in this paper our search strategies to adapt reference recommendation models to
low-precision hardware, our optimization of low-precision compute kernels, and
the design and development of tool chain so as to maintain our models' accuracy
throughout their lifespan during which topic trends and users' interests
inevitably evolve. Practicing these low-precision technologies helped us save
datacenter capacities while deploying models with up to 5X complexity that
would otherwise not be deployed on traditional general-purpose CPUs. We believe
these lessons from the trenches promote better co-design between hardware
architecture and software engineering and advance the state of the art of ML in
industry.
Related papers
- Adaptable Embeddings Network (AEN) [49.1574468325115]
We introduce Adaptable Embeddings Networks (AEN), a novel dual-encoder architecture using Kernel Density Estimation (KDE)
AEN allows for runtime adaptation of classification criteria without retraining and is non-autoregressive.
The architecture's ability to preprocess and cache condition embeddings makes it ideal for edge computing applications and real-time monitoring systems.
arXiv Detail & Related papers (2024-11-21T02:15:52Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Inference Optimization of Foundation Models on AI Accelerators [68.24450520773688]
Powerful foundation models, including large language models (LLMs), with Transformer architectures have ushered in a new era of Generative AI.
As the number of model parameters reaches to hundreds of billions, their deployment incurs prohibitive inference costs and high latency in real-world scenarios.
This tutorial offers a comprehensive discussion on complementary inference optimization techniques using AI accelerators.
arXiv Detail & Related papers (2024-07-12T09:24:34Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Mechanistic Design and Scaling of Hybrid Architectures [114.3129802943915]
We identify and test new hybrid architectures constructed from a variety of computational primitives.
We experimentally validate the resulting architectures via an extensive compute-optimal and a new state-optimal scaling law analysis.
We find MAD synthetics to correlate with compute-optimal perplexity, enabling accurate evaluation of new architectures.
arXiv Detail & Related papers (2024-03-26T16:33:12Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Model-to-Circuit Cross-Approximation For Printed Machine Learning
Classifiers [4.865819809855699]
Printed electronics (PE) promises on-demand fabrication, low non-recurring engineering costs, and sub-cent fabrication costs.
Large feature sizes in PE prohibit the realization of complex ML models in PE, even with bespoke architectures.
We present an automated, cross-layer approximation framework tailored to bespoke architectures that enable complex ML models in PE.
arXiv Detail & Related papers (2023-03-14T22:11:34Z) - Statistical Hardware Design With Multi-model Active Learning [1.7596501992526474]
We propose a model-based active learning approach to solve the problem of designing efficient hardware.
Our proposed method provides hardware models that are sufficiently accurate to perform design space exploration as well as performance prediction simultaneously.
arXiv Detail & Related papers (2023-03-14T16:37:38Z) - Cross-Layer Approximation For Printed Machine Learning Circuits [4.865819809855699]
We propose and implement a cross-layer approximation, tailored for bespoke machine learning (ML) architectures in printed electronics (PE)
Our results demonstrate that our cross approximation delivers optimal designs that, compared to the state-of-the-art exact designs, feature 47% and 44% average area and power reduction, respectively, and less than 1% accuracy loss.
arXiv Detail & Related papers (2022-03-11T13:41:15Z) - Hardware Acceleration of Sparse and Irregular Tensor Computations of ML
Models: A Survey and Insights [18.04657939198617]
This paper provides a comprehensive survey on the efficient execution of sparse and irregular tensor computations of machine learning models on hardware accelerators.
It analyzes different hardware designs and acceleration techniques and analyzes them in terms of hardware and execution costs.
The takeaways from this paper include: understanding the key challenges in accelerating sparse, irregular-shaped, and quantized tensors.
arXiv Detail & Related papers (2020-07-02T04:08:40Z) - Tidying Deep Saliency Prediction Architectures [6.613005108411055]
In this paper, we identify four key components of saliency models, i.e., input features, multi-level integration, readout architecture, and loss functions.
We propose two novel end-to-end architectures called SimpleNet and MDNSal, which are neater, minimal, more interpretable and achieve state of the art performance on public saliency benchmarks.
arXiv Detail & Related papers (2020-03-10T19:34:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.