ShadowSync: Performing Synchronization in the Background for Highly
Scalable Distributed Training
- URL: http://arxiv.org/abs/2003.03477v3
- Date: Tue, 23 Feb 2021 18:23:31 GMT
- Title: ShadowSync: Performing Synchronization in the Background for Highly
Scalable Distributed Training
- Authors: Qinqing Zheng, Bor-Yiing Su, Jiyan Yang, Alisson Azzolini, Qiang Wu,
Ou Jin, Shri Karandikar, Hagay Lupesko, Liang Xiong, Eric Zhou
- Abstract summary: We present shadowsync, a distributed framework specifically tailored to modern scale recommendation system training.
In contrast to previous works where synchronization happens as part of the training process, shadowsync separates the synchronization from training and runs it in the background.
- Score: 10.73956838502053
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recommendation systems are often trained with a tremendous amount of data,
and distributed training is the workhorse to shorten the training time. While
the training throughput can be increased by simply adding more workers, it is
also increasingly challenging to preserve the model quality. In this paper, we
present \shadowsync, a distributed framework specifically tailored to modern
scale recommendation system training. In contrast to previous works where
synchronization happens as part of the training process, \shadowsync separates
the synchronization from training and runs it in the background. Such isolation
significantly reduces the synchronization overhead and increases the
synchronization frequency, so that we are able to obtain both high throughput
and excellent model quality when training at scale. The superiority of our
procedure is confirmed by experiments on training deep neural networks for
click-through-rate prediction tasks. Our framework is capable to express data
parallelism and/or model parallelism, generic to host various types of
synchronization algorithms, and readily applicable to large scale problems in
other areas.
Related papers
- Streaming DiLoCo with overlapping communication: Towards a Distributed Free Lunch [66.84195842685459]
Training of large language models (LLMs) is typically distributed across a large number of accelerators to reduce training time.
Recently, distributed algorithms like DiLoCo have relaxed such co-location constraint.
We show experimentally that we can distribute training of billion-scale parameters and reach similar quality as before.
arXiv Detail & Related papers (2025-01-30T17:23:50Z) - Synchformer: Efficient Synchronization from Sparse Cues [100.89656994681934]
Our contributions include a novel audio-visual synchronization model, and training that decouples extraction from synchronization modelling.
This approach achieves state-of-the-art performance in both dense and sparse settings.
We also extend synchronization model training to AudioSet a million-scale 'in-the-wild' dataset, investigate evidence attribution techniques for interpretability, and explore a new capability for synchronization models: audio-visual synchronizability.
arXiv Detail & Related papers (2024-01-29T18:59:55Z) - Efficient Asynchronous Federated Learning with Sparsification and
Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data.
FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training.
We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z) - Empowering Distributed Training with Sparsity-driven Data Synchronization [33.95040042348349]
Distributed training is the de facto standard to scale up the training of deep learning models with multiple GPUs.
We first analyze the characteristics of sparse tensors in popular models to understand the fundamentals of sparsity.
We then systematically explore the design space of communication schemes for sparse tensors and find the optimal ones.
We demonstrate that Zen can achieve up to 5.09x speedup in communication time and up to $2.48times$ speedup in training throughput.
arXiv Detail & Related papers (2023-09-23T04:32:48Z) - Accelerating Distributed ML Training via Selective Synchronization [0.0]
textttSelSync is a practical, low-overhead method for DNN training that dynamically chooses to incur or avoid communication at each step.
Our system converges to the same or better accuracy than BSP while reducing training time by up to 14$times$.
arXiv Detail & Related papers (2023-07-16T05:28:59Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Efficient and Light-Weight Federated Learning via Asynchronous
Distributed Dropout [22.584080337157168]
Asynchronous learning protocols have regained attention lately, especially in the Federated Learning (FL) setup.
We propose textttAsyncDrop, a novel asynchronous FL framework that utilizes dropout regularization to handle device heterogeneity in distributed settings.
Overall, textttAsyncDrop achieves better performance compared to state of the art asynchronous methodologies.
arXiv Detail & Related papers (2022-10-28T13:00:29Z) - How Well Self-Supervised Pre-Training Performs with Streaming Data? [73.5362286533602]
In real-world scenarios where data are collected in a streaming fashion, the joint training scheme is usually storage-heavy and time-consuming.
It is unclear how well sequential self-supervised pre-training performs with streaming data.
We find sequential self-supervised learning exhibits almost the same performance as the joint training when the distribution shifts within streaming data are mild.
arXiv Detail & Related papers (2021-04-25T06:56:48Z) - Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep
Learning [10.196574441542646]
Gradient Descent (SGD) has become the de facto way to train deep neural networks in distributed clusters.
A critical factor in determining the training throughput and model accuracy is the choice of the parameter synchronization protocol.
In this paper, we design a hybrid synchronization approach that exploits the benefits of both BSP and ASP.
arXiv Detail & Related papers (2021-04-16T20:49:28Z) - Event-based Asynchronous Sparse Convolutional Networks [54.094244806123235]
Event cameras are bio-inspired sensors that respond to per-pixel brightness changes in the form of asynchronous and sparse "events"
We present a general framework for converting models trained on synchronous image-like event representations into asynchronous models with identical output.
We show both theoretically and experimentally that this drastically reduces the computational complexity and latency of high-capacity, synchronous neural networks.
arXiv Detail & Related papers (2020-03-20T08:39:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.