High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models
- URL: http://arxiv.org/abs/2104.05158v2
- Date: Tue, 13 Apr 2021 01:30:23 GMT
- Title: High-performance, Distributed Training of Large-scale Deep Learning
Recommendation Models
- Authors: Dheevatsa Mudigere, Yuchen Hao, Jianyu Huang, Andrew Tulloch, Srinivas
Sridharan, Xing Liu, Mustafa Ozdal, Jade Nie, Jongsoo Park, Liang Luo, Jie
Amy Yang, Leon Gao, Dmytro Ivchenko, Aarti Basant, Yuxi Hu, Jiyan Yang, Ehsan
K. Ardestani, Xiaodong Wang, Rakesh Komuravelli, Ching-Hsiang Chu, Serhat
Yilmaz, Huayu Li, Jiyuan Qian, Zhuobo Feng, Yinbin Ma, Junjie Yang, Ellie
Wen, Hong Li, Lin Yang, Chonglin Sun, Whitney Zhao, Krishna Dhulipala, KR
Kishore, Tyler Graf, Assaf Eisenman, Kiran Kumar Matam, Adi Gangidi, Pallab
Bhattacharya, Guoqiang Jerry Chen, Manoj Krishnan, Krishnakumar Nair, Petr
Lapukhov, Maxim Naumov, Lin Qiao, Mikhail Smelyanskiy, Bill Jia, Vijay Rao
- Abstract summary: Deep learning recommendation models (DLRMs) are used across many business-critical services at Facebook.
In this paper we discuss the SW/HW co-designed solution for high-performance distributed training of large-scale DLRMs.
We demonstrate the capability to train very large DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup in terms of time to solution over previous systems.
- Score: 18.63017668881868
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning recommendation models (DLRMs) are used across many
business-critical services at Facebook and are the single largest AI
application in terms of infrastructure demand in its data-centers. In this
paper we discuss the SW/HW co-designed solution for high-performance
distributed training of large-scale DLRMs. We introduce a high-performance
scalable software stack based on PyTorch and pair it with the new evolution of
Zion platform, namely ZionEX. We demonstrate the capability to train very large
DLRMs with up to 12 Trillion parameters and show that we can attain 40X speedup
in terms of time to solution over previous systems. We achieve this by (i)
designing the ZionEX platform with dedicated scale-out network, provisioned
with high bandwidth, optimal topology and efficient transport (ii) implementing
an optimized PyTorch-based training stack supporting both model and data
parallelism (iii) developing sharding algorithms capable of hierarchical
partitioning of the embedding tables along row, column dimensions and load
balancing them across multiple workers; (iv) adding high-performance core
operators while retaining flexibility to support optimizers with fully
deterministic updates (v) leveraging reduced precision communications,
multi-level memory hierarchy (HBM+DDR+SSD) and pipelining. Furthermore, we
develop and briefly comment on distributed data ingestion and other supporting
services that are required for the robust and efficient end-to-end training in
production environments.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - DPOT: Auto-Regressive Denoising Operator Transformer for Large-Scale PDE Pre-Training [87.90342423839876]
We present a new auto-regressive denoising pre-training strategy, which allows for more stable and efficient pre-training on PDE data.
We train our PDE foundation model with up to 0.5B parameters on 10+ PDE datasets with more than 100k trajectories.
arXiv Detail & Related papers (2024-03-06T08:38:34Z) - MAD Max Beyond Single-Node: Enabling Large Machine Learning Model Acceleration on Distributed Systems [6.8519529064678375]
Training and deploying large-scale machine learning models is time-consuming, requires significant distributed computing infrastructures, and incurs high operational costs.
To minimize this outstanding communication latency and other inherent at-scale inefficiencies, we introduce an agile performance modeling framework, MAD-Max.
This framework is designed to optimize parallelization strategies and facilitate hardware-software co-design opportunities.
arXiv Detail & Related papers (2023-10-04T13:00:53Z) - PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel [19.24542340170026]
We introduce PyTorch Fully Sharded Data Parallel (FSDP) as an industry-grade solution for large model training.
FSDP provides support for significantly larger models with near-linear scalability in terms of TFLOPS.
arXiv Detail & Related papers (2023-04-21T23:52:27Z) - Slapo: A Schedule Language for Progressive Optimization of Large Deep
Learning Model Training [17.556432199389615]
Slapo is a schedule language that decouples the execution of a tensor-level operator from its arithmetic definition.
We show that Slapo can improve training throughput by up to 2.92x on a single machine with 8 NVIDIA V100 GPUs.
arXiv Detail & Related papers (2023-02-16T00:34:53Z) - Partitioning Distributed Compute Jobs with Reinforcement Learning and
Graph Neural Networks [58.720142291102135]
Large-scale machine learning models are bringing advances to a broad range of fields.
Many of these models are too large to be trained on a single machine, and must be distributed across multiple devices.
We show that maximum parallelisation is sub-optimal in relation to user-critical metrics such as throughput and blocking rate.
arXiv Detail & Related papers (2023-01-31T17:41:07Z) - A GPU-specialized Inference Parameter Server for Large-Scale Deep
Recommendation Models [6.823233135936128]
Recommendation systems are crucial for a variety of modern apps and web services, such as news feeds, social networks, e-commerce, search, etc.
To achieve peak prediction accuracy, modern recommendation models combine deep learning with terabyte-scale embedding tables to obtain a fine-grained representation of the underlying data.
Traditional inference serving architectures require deploying the whole model to standalone servers, which is infeasible at such massive scale.
arXiv Detail & Related papers (2022-10-17T07:36:18Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Deep Generative Models that Solve PDEs: Distributed Computing for
Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z) - Large-Scale Gradient-Free Deep Learning with Recursive Local
Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources.
Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize.
We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.