Optimizing Deep Learning Recommender Systems' Training On CPU Cluster
Architectures
- URL: http://arxiv.org/abs/2005.04680v1
- Date: Sun, 10 May 2020 14:40:16 GMT
- Title: Optimizing Deep Learning Recommender Systems' Training On CPU Cluster
Architectures
- Authors: Dhiraj Kalamkar, Evangelos Georganas, Sudarshan Srinivasan, Jianping
Chen, Mikhail Shiryaev, Alexander Heinecke
- Abstract summary: We focus on Recommender Systems which account for most of the AI cycles in cloud computing centers.
By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance.
- Score: 56.69373580921888
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: During the last two years, the goal of many researchers has been to squeeze
the last bit of performance out of HPC system for AI tasks. Often this
discussion is held in the context of how fast ResNet50 can be trained.
Unfortunately, ResNet50 is no longer a representative workload in 2020. Thus,
we focus on Recommender Systems which account for most of the AI cycles in
cloud computing centers. More specifically, we focus on Facebook's DLRM
benchmark. By enabling it to run on latest CPU hardware and software tailored
for HPC, we are able to achieve more than two-orders of magnitude improvement
in performance (110x) on a single socket compared to the reference CPU
implementation, and high scaling efficiency up to 64 sockets, while fitting
ultra-large datasets. This paper discusses the optimization techniques for the
various operators in DLRM and which component of the systems are stressed by
these different operators. The presented techniques are applicable to a broader
set of DL workloads that pose the same scaling challenges/characteristics as
DLRM.
Related papers
- Distributed Inference and Fine-tuning of Large Language Models Over The
Internet [91.00270820533272]
Large language models (LLMs) are useful in many NLP tasks and become more capable with size.
These models require high-end hardware, making them inaccessible to most researchers.
We develop fault-tolerant inference algorithms and load-balancing protocols that automatically assign devices to maximize the total system throughput.
arXiv Detail & Related papers (2023-12-13T18:52:49Z) - Spreeze: High-Throughput Parallel Reinforcement Learning Framework [19.3019166138232]
Spreeze is a lightweight parallel framework for reinforcement learning.
It efficiently utilizes a single desktop hardware resource to approach the throughput limit.
It can achieve up to 15,000Hz experience sampling and 370,000Hz network update frame rate.
arXiv Detail & Related papers (2023-12-11T05:25:01Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks.
specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples.
We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z) - High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models [18.63017668881868]
Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
arXiv Detail & Related papers (2021-04-12T02:15:55Z) - Towards High Performance Java-based Deep Learning Frameworks [0.22940141855172028]
Modern cloud services have set the demand for fast and efficient data processing.
This demand is common among numerous application domains, such as deep learning, data mining, and computer vision.
In this paper we have employed TornadoVM, a state-of-the-art programming framework to transparently accelerate Deep Netts; a Java-based deep learning framework.
arXiv Detail & Related papers (2020-01-13T13:03:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.