ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table
- URL: http://arxiv.org/abs/2104.08542v1
- Date: Sat, 17 Apr 2021 13:36:19 GMT
- Title: ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table
- Authors: Huifeng Guo, Wei Guo, Yong Gao, Ruiming Tang, Xiuqiang He, Wenzhi Liu
- Abstract summary: Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies.
To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently.
We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
- Score: 23.264897780201316
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Because of the superior feature representation ability of deep learning,
various deep Click-Through Rate (CTR) models are deployed in the commercial
systems by industrial companies. To achieve better performance, it is necessary
to train the deep CTR models on huge volume of training data efficiently, which
makes speeding up the training process an essential problem. Different from the
models with dense training data, the training data for CTR models is usually
high-dimensional and sparse. To transform the high-dimensional sparse input
into low-dimensional dense real-value vectors, almost all deep CTR models adopt
the embedding layer, which easily reaches hundreds of GB or even TB. Since a
single GPU cannot afford to accommodate all the embedding parameters, when
performing distributed training, it is not reasonable to conduct the
data-parallelism only. Therefore, existing distributed training platforms for
recommendation adopt model-parallelism. Specifically, they use CPU (Host)
memory of servers to maintain and update the embedding parameters and utilize
GPU worker to conduct forward and backward computations. Unfortunately, these
platforms suffer from two bottlenecks: (1) the latency of pull \& push
operations between Host and GPU; (2) parameters update and synchronization in
the CPU servers. To address such bottlenecks, in this paper, we propose the
ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
Specifically, in SFCTR, we also store huge embedding table in CPU but utilize
GPU instead of CPU to conduct embedding synchronization efficiently. To reduce
the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache
mechanism and Virtual Sparse Id operation are proposed. Comprehensive
experiments and ablation studies are conducted to demonstrate the effectiveness
and efficiency of SFCTR.
Related papers
- Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence.
We find that gradients require milder compression rates than activations.
Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems [15.622358361804343]
We propose a novel end-to-end training framework that pipelines the feature extraction and the training on GPU servers to save the intermediate I/O of the feature extraction.
We present a light-weight GPU memory management algorithm that supports dynamic GPU memory allocation with minimal overhead.
arXiv Detail & Related papers (2022-09-26T02:31:13Z) - Merak: An Efficient Distributed DNN Training Framework with Automated 3D
Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization.
Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model.
Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z) - Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z) - ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep
Reinforcement Learning [141.58588761593955]
We present a library ElegantRL-podracer for cloud-native deep reinforcement learning.
It efficiently supports millions of cores to carry out massively parallel training at multiple levels.
At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly 7,000 GPU cores in a single GPU.
arXiv Detail & Related papers (2021-12-11T06:31:21Z) - PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based
Memory Management [19.341284825473558]
Pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology.
PTM learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset.
PatrickStar reduces memory requirements of computing platforms by using heterogeneous memory space.
arXiv Detail & Related papers (2021-08-12T15:58:12Z) - High-Performance Training by Exploiting Hot-Embeddings in Recommendation
Systems [2.708848417398231]
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications.
These models use massive embedding tables to store a numerical representation of item's and user's categorical variables.
Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU.
This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
arXiv Detail & Related papers (2021-03-01T01:43:26Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of
Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models.
This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.