Related papers: ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table

URL: http://arxiv.org/abs/2104.08542v1
Date: Sat, 17 Apr 2021 13:36:19 GMT
Title: ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models with Huge Embedding Table
Authors: Huifeng Guo, Wei Guo, Yong Gao, Ruiming Tang, Xiuqiang He, Wenzhi Liu
Abstract summary: Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently. We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
Score: 23.264897780201316
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Because of the superior feature representation ability of deep learning, various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies. To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently, which makes speeding up the training process an essential problem. Different from the models with dense training data, the training data for CTR models is usually high-dimensional and sparse. To transform the high-dimensional sparse input into low-dimensional dense real-value vectors, almost all deep CTR models adopt the embedding layer, which easily reaches hundreds of GB or even TB. Since a single GPU cannot afford to accommodate all the embedding parameters, when performing distributed training, it is not reasonable to conduct the data-parallelism only. Therefore, existing distributed training platforms for recommendation adopt model-parallelism. Specifically, they use CPU (Host) memory of servers to maintain and update the embedding parameters and utilize GPU worker to conduct forward and backward computations. Unfortunately, these platforms suffer from two bottlenecks: (1) the latency of pull \& push operations between Host and GPU; (2) parameters update and synchronization in the CPU servers. To address such bottlenecks, in this paper, we propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models. Specifically, in SFCTR, we also store huge embedding table in CPU but utilize GPU instead of CPU to conduct embedding synchronization efficiently. To reduce the latency of data transfer between both GPU-Host and GPU-GPU, the MixCache mechanism and Virtual Sparse Id operation are proposed. Comprehensive experiments and ablation studies are conducted to demonstrate the effectiveness and efficiency of SFCTR.

Related papers

Protocol Models: Scaling Decentralized Training with Communication-Efficient Model Parallelism [59.79227116582264]
Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging.<n>We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation.
arXiv Detail & Related papers (2025-06-02T02:19:22Z)
A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
TensorSocket: Shared Data Loading for Deep Learning Training [0.0]
Deep learning training is a repetitive and resource-intensive process.<n>In this paper, we presentSocket to reduce the computational needs of training by enabling simultaneous training processes to share the same data loader.<n>Our evaluation shows that colSocket enables scenarios that are infeasible without data sharing, increases training throughput by up to 100%, and when utilizing cloud instances, achieves cost savings of 50%.
arXiv Detail & Related papers (2024-09-27T13:39:47Z)
Activations and Gradients Compression for Model-Parallel Training [85.99744701008802]
We study how simultaneous compression of activations and gradients in model-parallel distributed training setup affects convergence. We find that gradients require milder compression rates than activations. Experiments also show that models trained with TopK perform well only when compression is also applied during inference.
arXiv Detail & Related papers (2024-01-15T15:54:54Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism. We implement and evaluate three common classes of compression algorithms. We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z)
FeatureBox: Feature Engineering on GPUs for Massive-Scale Ads Systems [15.622358361804343]
We propose a novel end-to-end training framework that pipelines the feature extraction and the training on GPU servers to save the intermediate I/O of the feature extraction. We present a light-weight GPU memory management algorithm that supports dynamic GPU memory allocation with minimal overhead.
arXiv Detail & Related papers (2022-09-26T02:31:13Z)
Merak: An Efficient Distributed DNN Training Framework with Automated 3D Parallelism for Giant Foundation Models [14.903847751841221]
We propose Merak, an automated 3D parallelism deep learning training framework with high resource utilization. Merak automatically deploys with an automatic model partitioner, which uses a graph sharding algorithm on a proxy representation of the model. Merak can speedup the training performance over the state-of-the-art 3D parallelism frameworks of models with 1.5, 2.5, 8.3, and 20 billion parameters by up to 1.42X, 1.39X, 1.43X, and 1.61X, respectively.
arXiv Detail & Related papers (2022-06-10T09:15:48Z)
Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising [32.5337643852876]
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. We identify two major challenges in the existing GPU training for massivescale ad models. We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
arXiv Detail & Related papers (2022-01-05T18:09:11Z)
ElegantRL-Podracer: Scalable and Elastic Library for Cloud-Native Deep Reinforcement Learning [141.58588761593955]
We present a library ElegantRL-podracer for cloud-native deep reinforcement learning. It efficiently supports millions of cores to carry out massively parallel training at multiple levels. At a low-level, each pod simulates agent-environment interactions in parallel by fully utilizing nearly 7,000 GPU cores in a single GPU.
arXiv Detail & Related papers (2021-12-11T06:31:21Z)
PatrickStar: Parallel Training of Pre-trained Models via a Chunk-based Memory Management [19.341284825473558]
Pre-trained model (PTM) is revolutionizing Artificial intelligence (AI) technology. PTM learns a model with general language features on the vast text and then fine-tunes the model using a task-specific dataset. PatrickStar reduces memory requirements of computing platforms by using heterogeneous memory space.
arXiv Detail & Related papers (2021-08-12T15:58:12Z)
High-Performance Training by Exploiting Hot-Embeddings in Recommendation Systems [2.708848417398231]
Recommendation models are commonly used learning models that suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store a numerical representation of item's and user's categorical variables. Due to these conflicting compute and memory requirements, the training process for recommendation models is divided across CPU and GPU. This paper tries to leverage skewed embedded table accesses to efficiently use the GPU resources during training.
arXiv Detail & Related papers (2021-03-01T01:43:26Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Joint Parameter-and-Bandwidth Allocation for Improving the Efficiency of Partitioned Edge Learning [73.82875010696849]
Machine learning algorithms are deployed at the network edge for training artificial intelligence (AI) models. This paper focuses on the novel joint design of parameter (computation load) allocation and bandwidth allocation.
arXiv Detail & Related papers (2020-03-10T05:52:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.