GBA: A Tuning-free Approach to Switch between Synchronous and
Asynchronous Training for Recommendation Model
- URL: http://arxiv.org/abs/2205.11048v1
- Date: Mon, 23 May 2022 05:22:42 GMT
- Title: GBA: A Tuning-free Approach to Switch between Synchronous and
Asynchronous Training for Recommendation Model
- Authors: Wenbo Su, Yuanxing Zhang, Yufeng Cai, Kaixu Ren, Pengjie Wang, Huimin
Yi, Yue Song, Jing Chen, Hongbo Deng, Jian Xu, Lin Qu, Bo zheng
- Abstract summary: We propose Global Batch gradients Aggregation (GBA) over parameter server (PS)
A token-control process is implemented to assemble the gradients and decay the gradients with severe staleness.
Experiments on three industrial-scale recommendation tasks show that GBA is an effective tuning-free approach for switching.
- Score: 19.65557684234458
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-concurrency asynchronous training upon parameter server (PS)
architecture and high-performance synchronous training upon all-reduce (AR)
architecture are the most commonly deployed distributed training modes for
recommender systems. Although the synchronous AR training is designed to have
higher training efficiency, the asynchronous PS training would be a better
choice on training speed when there are stragglers (slow workers) in the shared
cluster, especially under limited computing resources. To take full advantages
of these two training modes, an ideal way is to switch between them upon the
cluster status. We find two obstacles to a tuning-free approach: the different
distribution of the gradient values and the stale gradients from the
stragglers. In this paper, we propose Global Batch gradients Aggregation (GBA)
over PS, which aggregates and applies gradients with the same global batch size
as the synchronous training. A token-control process is implemented to assemble
the gradients and decay the gradients with severe staleness. We provide the
convergence analysis to demonstrate the robustness of GBA over the
recommendation models against the gradient staleness. Experiments on three
industrial-scale recommendation tasks show that GBA is an effective tuning-free
approach for switching. Compared to the state-of-the-art derived asynchronous
training, GBA achieves up to 0.2% improvement on the AUC metric, which is
significant for the recommendation models. Meanwhile, under the strained
hardware resource, GBA speeds up at least 2.4x compared to the synchronous
training.
Related papers
- Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step.
Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z) - TimelyFL: Heterogeneity-aware Asynchronous Federated Learning with
Adaptive Partial Training [17.84692242938424]
TimelyFL is a heterogeneous-aware asynchronous Federated Learning framework with adaptive partial training.
We show that TimelyFL improves participation rate by 21.13%, harvests 1.28x - 2.89x more efficiency on convergence rate, and provides a 6.25% increment on test accuracy.
arXiv Detail & Related papers (2023-04-14T06:26:08Z) - Stochastic Coded Federated Learning: Theoretical Analysis and Incentive
Mechanism Design [18.675244280002428]
We propose a novel FL framework named coded federated learning (SCFL) that leverages coded computing techniques.
In SCFL, each edge device uploads a privacy-preserving coded dataset to the server, which is generated by adding noise to the projected local dataset.
We show that SCFL learns a better model within the given time and achieves a better privacy-performance tradeoff than the baseline methods.
arXiv Detail & Related papers (2022-11-08T09:58:36Z) - Semi-Synchronous Personalized Federated Learning over Mobile Edge
Networks [88.50555581186799]
We propose a semi-synchronous PFL algorithm, termed as Semi-Synchronous Personalized FederatedAveraging (PerFedS$2$), over mobile edge networks.
We derive an upper bound of the convergence rate of PerFedS2 in terms of the number of participants per global round and the number of rounds.
Experimental results verify the effectiveness of PerFedS2 in saving training time as well as guaranteeing the convergence of training loss.
arXiv Detail & Related papers (2022-09-27T02:12:43Z) - Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep
Learning [10.196574441542646]
Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters.
A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol.
In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP.
arXiv Detail & Related papers (2021-04-16T20:49:28Z) - High-Throughput Synchronous Deep RL [132.43861715707905]
We propose High-Throughput Synchronous Deep Reinforcement Learning (HTS-RL)
We perform learning and rollouts concurrently, devise a system design which avoids stale policies'
We evaluate our approach on Atari games and the Google Research Football environment.
arXiv Detail & Related papers (2020-12-17T18:59:01Z) - An Efficient Asynchronous Method for Integrating Evolutionary and
Gradient-based Policy Search [76.73477450555046]
We introduce an Asynchronous Evolution Strategy-Reinforcement Learning (AES-RL) that maximizes the parallel efficiency of ES and integrates it with policy gradient methods.
Specifically, we propose 1) a novel framework to merge ES and DRL asynchronously and 2) various asynchronous update methods that can take all advantages of asynchronism, ES, and DRL.
arXiv Detail & Related papers (2020-12-10T02:30:48Z) - Sparse Communication for Training Deep Networks [56.441077560085475]
Synchronous gradient descent (SGD) is the most common method used for distributed training of deep learning models.
In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers.
We study several compression schemes and identify how three key parameters affect the performance.
arXiv Detail & Related papers (2020-09-19T17:28:11Z) - Scaling Distributed Deep Learning Workloads beyond the Memory Capacity
with KARMA [58.040931661693925]
We propose a strategy that combines redundant recomputing and out-of-core methods.
We achieve an average of 1.52x speedup in six different models over the state-of-the-art out-of-core methods.
Our data parallel out-of-core solution can outperform complex hybrid model parallelism in training large models, e.g. Megatron-LM and Turning-NLG.
arXiv Detail & Related papers (2020-08-26T07:24:34Z) - Adaptive Braking for Mitigating Gradient Delay [0.8602553195689513]
We introduce Adaptive Braking, a modification for momentum-based gradients that mitigates the effects of gradient delay.
We show that applying AB on top of SGD with momentum enables training ResNets on CIFAR-10 and ImageNet-1k with delays with minimal drop in final test accuracy.
arXiv Detail & Related papers (2020-07-02T21:26:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.