Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising
- URL: http://arxiv.org/abs/2201.05500v1
- Date: Wed, 5 Jan 2022 18:09:11 GMT
- Title: Communication-Efficient TeraByte-Scale Model Training Framework for
Online Advertising
- Authors: Weijie Zhao, Xuewu Jiao, Mingqing Hu, Xiaoyun Li, Xiangyu Zhang, Ping
Li
- Abstract summary: Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry.
We identify two major challenges in the existing GPU training for massivescale ad models.
We propose a hardware-aware training workflow that couples the hardware topology into the algorithm design.
- Score: 32.5337643852876
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Click-Through Rate (CTR) prediction is a crucial component in the online
advertising industry. In order to produce a personalized CTR prediction, an
industry-level CTR prediction model commonly takes a high-dimensional (e.g.,
100 or 1000 billions of features) sparse vector (that is encoded from query
keywords, user portraits, etc.) as input. As a result, the model requires
Terabyte scale parameters to embed the high-dimensional input. Hierarchical
distributed GPU parameter server has been proposed to enable GPU with limited
memory to train the massive network by leveraging CPU main memory and SSDs as
secondary storage. We identify two major challenges in the existing GPU
training framework for massive-scale ad models and propose a collection of
optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly
communicate with each other during the training. The connections between GPUs
and CPUs are non-uniform due to the hardware topology. The data communication
route should be optimized according to the hardware topology; (b) GPUs in
different computing nodes frequently communicates to synchronize parameters. We
are required to optimize the communications so that the distributed system can
become scalable.
In this paper, we propose a hardware-aware training workflow that couples the
hardware topology into the algorithm design. To reduce the extensive
communication between computing nodes, we introduce a $k$-step model merging
algorithm for the popular Adam optimizer and provide its convergence rate in
non-convex optimization. To the best of our knowledge, this is the first
application of $k$-step adaptive optimization method in industrial-level CTR
model training. The numerical results on real-world data confirm that the
optimized system design considerably reduces the training time of the massive
model, with essentially no loss in accuracy.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Partitioned Neural Network Training via Synthetic Intermediate Labels [0.0]
GPU memory constraints have become a notable bottleneck in training such sizable models.
This study advocates partitioning the model across GPU and generating synthetic intermediate labels to train individual segments.
This approach results in a more efficient training process that minimizes data communication while maintaining model accuracy.
arXiv Detail & Related papers (2024-03-17T13:06:29Z) - CORE: Common Random Reconstruction for Distributed Optimization with
Provable Low Communication Complexity [110.50364486645852]
Communication complexity has become a major bottleneck for speeding up training and scaling up machine numbers.
We propose Common Om REOm, which can be used to compress information transmitted between machines.
arXiv Detail & Related papers (2023-09-23T08:45:27Z) - Towards a Better Theoretical Understanding of Independent Subnetwork Training [56.24689348875711]
We take a closer theoretical look at Independent Subnetwork Training (IST)
IST is a recently proposed and highly effective technique for solving the aforementioned problems.
We identify fundamental differences between IST and alternative approaches, such as distributed methods with compressed communication.
arXiv Detail & Related papers (2023-06-28T18:14:22Z) - ParaGraph: Weighted Graph Representation for Performance Optimization of
HPC Kernels [1.304892050913381]
We introduce a new graph-based program representation for parallel applications that extends the Abstract Syntax Tree.
We evaluate our proposed representation by training a Graph Neural Network (GNN) to predict the runtime of an OpenMP code region.
Results show that our approach is indeed effective and has normalized RMSE as low as 0.004 to at most 0.01 in its runtime predictions.
arXiv Detail & Related papers (2023-04-07T05:52:59Z) - Scalable Graph Convolutional Network Training on Distributed-Memory
Systems [5.169989177779801]
Graph Convolutional Networks (GCNs) are extensively utilized for deep learning on graphs.
Since the convolution operation on graphs induces irregular memory access patterns, designing a memory- and communication-efficient parallel algorithm for GCN training poses unique challenges.
We propose a highly parallel training algorithm that scales to large processor counts.
arXiv Detail & Related papers (2022-12-09T17:51:13Z) - An Adaptive Device-Edge Co-Inference Framework Based on Soft
Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices.
We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations.
Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z) - Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous
Multi-GPU Servers [65.60007071024629]
We show that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
We show experimentally that Adaptive SGD outperforms four state-of-the-art solutions in time-to-accuracy.
arXiv Detail & Related papers (2021-10-13T20:58:15Z) - ScaleFreeCTR: MixCache-based Distributed Training System for CTR Models
with Huge Embedding Table [23.264897780201316]
Various deep Click-Through Rate (CTR) models are deployed in the commercial systems by industrial companies.
To achieve better performance, it is necessary to train the deep CTR models on huge volume of training data efficiently.
We propose the ScaleFreeCTR: a MixCache-based distributed training system for CTR models.
arXiv Detail & Related papers (2021-04-17T13:36:19Z) - Training Recommender Systems at Scale: Communication-Efficient Model and
Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training.
DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively.
It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z) - Heterogeneous CPU+GPU Stochastic Gradient Descent Algorithms [1.3249453757295084]
We study training algorithms for deep learning on heterogeneous CPU+GPU architectures.
Our two-fold objective -- maximize convergence rate and resource utilization simultaneously -- makes the problem challenging.
We show that the implementation of these algorithms achieves both faster convergence and higher resource utilization than on several real datasets.
arXiv Detail & Related papers (2020-04-19T05:21:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.