Related papers: Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters

Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters

URL: http://arxiv.org/abs/2111.05897v1
Date: Wed, 10 Nov 2021 19:40:25 GMT
Title: Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to 100 Trillion Parameters
Authors: Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He, Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao, Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang, Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei Yuan, Hai Yu, Sen Yang, Ce Zhang, Ji Liu
Abstract summary: Deep learning models have dominated the current landscape of production recommender systems. Recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. However, the training of such models is challenging even within industrial scale data centers.
Score: 36.1028179125367
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Deep learning based models have dominated the current landscape of production recommender systems. Furthermore, recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters. Significant quality boost has come with each jump of the model capacity, which makes us believe the era of 100 trillion parameters is around the corner. However, the training of such models is challenging even within industrial scale data centers. This difficulty is inherited from the staggering heterogeneity of the training computation--the model's embedding layer could include more than 99.99% of the total model size, which is extremely memory-intensive; while the rest neural network is increasingly computation-intensive. To support the training of such huge models, an efficient distributed training system is in urgent need. In this paper, we resolve this challenge by careful co-design of both the optimization algorithm and the distributed system architecture. Specifically, in order to ensure both the training efficiency and the training accuracy, we design a novel hybrid training algorithm, where the embedding layer and the dense neural network are handled by different synchronization mechanisms; then we build a system called Persia (short for parallel recommendation training system with hybrid acceleration) to support this hybrid training algorithm. Both theoretical demonstration and empirical study up to 100 trillion parameters have conducted to justified the system design and implementation of Persia. We make Persia publicly available (at https://github.com/PersiaML/Persia) so that anyone would be able to easily train a recommender model at the scale of 100 trillion parameters.

Related papers

OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via Omniverse Computation Balance [65.48009829137824]
Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices. We rebalanced the computational loads from data, model, and memory perspectives to address this issue. Our method's efficacy and generalizability were further demonstrated across various models and datasets.
arXiv Detail & Related papers (2024-07-30T12:02:58Z)
Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models. This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments. This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z)
Training Deep Surrogate Models with Large Scale Online Learning [48.7576911714538]
Deep learning algorithms have emerged as a viable alternative for obtaining fast solutions for PDEs. Models are usually trained on synthetic data generated by solvers, stored on disk and read back for training. It proposes an open source online training framework for deep surrogate models.
arXiv Detail & Related papers (2023-06-28T12:02:27Z)
Bayesian Generational Population-Based Training [35.70338636901159]
Population-Based Training (PBT) has led to impressive performance in several large scale settings. We introduce two new innovations in PBT-style methods. We show that these innovations lead to large performance gains.
arXiv Detail & Related papers (2022-07-19T16:57:38Z)
Decentralized Training of Foundation Models in Heterogeneous Environments [77.47261769795992]
Training foundation models, such as GPT-3 and PaLM, can be extremely expensive. We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
arXiv Detail & Related papers (2022-06-02T20:19:51Z)
Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel Training [23.633810934134065]
Colossal-AI can achieve up to 2.76 times training speedup on large-scale models. System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
arXiv Detail & Related papers (2021-10-28T04:45:55Z)
LCS: Learning Compressible Subspaces for Adaptive Network Compression at Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models. We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity. Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z)
M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion Parameter Pretraining [55.16088793437898]
Training extreme-scale models requires enormous amounts of computes and memory footprint. We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
arXiv Detail & Related papers (2021-10-08T04:24:51Z)
Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
extreme classification has become an essential topic for deep learning. It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer. We build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
arXiv Detail & Related papers (2021-02-09T06:53:31Z)
Deep Generative Models that Solve PDEs: Distributed Computing for Training Large Data-Free Models [25.33147292369218]
Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs) Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models. Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
arXiv Detail & Related papers (2020-07-24T22:42:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.