Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to
  100 Trillion Parameters
        - URL: http://arxiv.org/abs/2111.05897v1
- Date: Wed, 10 Nov 2021 19:40:25 GMT
- Title: Persia: A Hybrid System Scaling Deep Learning Based Recommenders up to
  100 Trillion Parameters
- Authors: Xiangru Lian, Binhang Yuan, Xuefeng Zhu, Yulong Wang, Yongjun He,
  Honghuan Wu, Lei Sun, Haodong Lyu, Chengjun Liu, Xing Dong, Yiqiao Liao,
  Mingnan Luo, Congfei Zhang, Jingru Xie, Haonan Li, Lei Chen, Renjie Huang,
  Jianying Lin, Chengchun Shu, Xuezhong Qiu, Zhishan Liu, Dongying Kong, Lei
  Yuan, Hai Yu, Sen Yang, Ce Zhang, Ji Liu
- Abstract summary: Deep learning models have dominated the current landscape of production recommender systems.
Recent years have witnessed an exponential growth of the model scale--from Google's 2016 model with 1 billion parameters to the latest Facebook's model with 12 trillion parameters.
However, the training of such models is challenging even within industrial scale data centers.
- Score: 36.1028179125367
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract:   Deep learning based models have dominated the current landscape of production
recommender systems. Furthermore, recent years have witnessed an exponential
growth of the model scale--from Google's 2016 model with 1 billion parameters
to the latest Facebook's model with 12 trillion parameters. Significant quality
boost has come with each jump of the model capacity, which makes us believe the
era of 100 trillion parameters is around the corner. However, the training of
such models is challenging even within industrial scale data centers. This
difficulty is inherited from the staggering heterogeneity of the training
computation--the model's embedding layer could include more than 99.99% of the
total model size, which is extremely memory-intensive; while the rest neural
network is increasingly computation-intensive. To support the training of such
huge models, an efficient distributed training system is in urgent need. In
this paper, we resolve this challenge by careful co-design of both the
optimization algorithm and the distributed system architecture. Specifically,
in order to ensure both the training efficiency and the training accuracy, we
design a novel hybrid training algorithm, where the embedding layer and the
dense neural network are handled by different synchronization mechanisms; then
we build a system called Persia (short for parallel recommendation training
system with hybrid acceleration) to support this hybrid training algorithm.
Both theoretical demonstration and empirical study up to 100 trillion
parameters have conducted to justified the system design and implementation of
Persia. We make Persia publicly available (at
https://github.com/PersiaML/Persia) so that anyone would be able to easily
train a recommender model at the scale of 100 trillion parameters.
 
      
        Related papers
        - OmniBal: Towards Fast Instruct-tuning for Vision-Language Models via   Omniverse Computation Balance [65.48009829137824]
 Large-scale 3D parallel training on vision-language instruct-tuning models leads to an imbalanced computation load across different devices.
We rebalanced the computational loads from data, model, and memory perspectives to address this issue.
Our method's efficacy and generalizability were further demonstrated across various models and datasets.
 arXiv  Detail & Related papers  (2024-07-30T12:02:58Z)
- Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
 GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
 arXiv  Detail & Related papers  (2024-03-17T13:06:29Z)
- Training Deep Surrogate Models with Large Scale Online Learning [48.7576911714538]
 Deep learning algorithms have emerged as a viable alternative for obtaining fast solutions for PDEs.
Models are usually trained on synthetic data generated by solvers, stored on disk and read back for training.
It proposes an open source online training framework for deep surrogate models.
 arXiv  Detail & Related papers  (2023-06-28T12:02:27Z)
- Bayesian Generational Population-Based Training [35.70338636901159]
 Population-Based Training (PBT) has led to impressive performance in several large scale settings.
We introduce two new innovations in PBT-style methods.
We show that these innovations lead to large performance gains.
 arXiv  Detail & Related papers  (2022-07-19T16:57:38Z)
- Decentralized Training of Foundation Models in Heterogeneous
  Environments [77.47261769795992]
 Training foundation models, such as GPT-3 and PaLM, can be extremely expensive.
We present the first study of training large foundation models with model parallelism in a decentralized regime over a heterogeneous network.
 arXiv  Detail & Related papers  (2022-06-02T20:19:51Z)
- Colossal-AI: A Unified Deep Learning System For Large-Scale Parallel
  Training [23.633810934134065]
 Colossal-AI can achieve up to 2.76 times training speedup on large-scale models.
System supports parallel training methods such as data, pipeline, tensor, and sequence parallelism.
 arXiv  Detail & Related papers  (2021-10-28T04:45:55Z)
- LCS: Learning Compressible Subspaces for Adaptive Network Compression at
  Inference Time [57.52251547365967]
 We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
 arXiv  Detail & Related papers  (2021-10-08T17:03:34Z)
- M6-10T: A Sharing-Delinking Paradigm for Efficient Multi-Trillion
  Parameter Pretraining [55.16088793437898]
 Training extreme-scale models requires enormous amounts of computes and memory footprint.
We propose a simple training strategy called "Pseudo-to-Real" for high-memory-footprint-required large models.
 arXiv  Detail & Related papers  (2021-10-08T04:24:51Z)
- Large-Scale Training System for 100-Million Classification at Alibaba [43.58719630882661]
 extreme classification has become an essential topic for deep learning.
It is very challenging to train a deep model with millions of classes due to the memory and explosion in the last output layer.
We build a hybrid parallel training framework to make the training process feasible.
Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs.
 arXiv  Detail & Related papers  (2021-02-09T06:53:31Z)
- Deep Generative Models that Solve PDEs: Distributed Computing for
  Training Large Data-Free Models [25.33147292369218]
 Recent progress in scientific machine learning (SciML) has opened up the possibility of training novel neural network architectures that solve complex partial differential equations (PDEs)
Here we report on a software framework for data parallel distributed deep learning that resolves the twin challenges of training these large SciML models.
Our framework provides several out of the box functionality including (a) loss integrity independent of number of processes, (b) synchronized batch normalization, and (c) distributed higher-order optimization methods.
 arXiv  Detail & Related papers  (2020-07-24T22:42:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
       
     
           This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.