Distributed Deep Learning Using Volunteer Computing-Like Paradigm
- URL: http://arxiv.org/abs/2103.08894v1
- Date: Tue, 16 Mar 2021 07:32:58 GMT
- Title: Distributed Deep Learning Using Volunteer Computing-Like Paradigm
- Authors: Medha Atre and Birendra Jha and Ashwini Rao
- Abstract summary: Training Deep Learning models with large number of parameters and/or large datasets can become prohibitive.
Current solutions, built predominantly for cluster computing systems, can still be an issue.
We design a distributed solution that can run DL training on a VC system by using a data parallel approach.
- Score: 0.09668407688201358
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Use of Deep Learning (DL) in commercial applications such as image
classification, sentiment analysis and speech recognition is increasing. When
training DL models with large number of parameters and/or large datasets, cost
and speed of training can become prohibitive. Distributed DL training solutions
that split a training job into subtasks and execute them over multiple nodes
can decrease training time. However, the cost of current solutions, built
predominantly for cluster computing systems, can still be an issue. In contrast
to cluster computing systems, Volunteer Computing (VC) systems can lower the
cost of computing, but applications running on VC systems have to handle fault
tolerance, variable network latency and heterogeneity of compute nodes, and the
current solutions are not designed to do so. We design a distributed solution
that can run DL training on a VC system by using a data parallel approach. We
implement a novel asynchronous SGD scheme called VC-ASGD suited for VC systems.
In contrast to traditional VC systems that lower cost by using untrustworthy
volunteer devices, we lower cost by leveraging preemptible computing instances
on commercial cloud platforms. By using preemptible instances that require
applications to be fault tolerant, we lower cost by 70-90% and improve data
security.
Related papers
- Exploring the Impact of Serverless Computing on Peer To Peer Training
Machine Learning [0.3441021278275805]
We introduce a novel architecture that combines serverless computing with P2P networks for distributed training.
Our findings show a significant enhancement in computation time, with up to a 97.34% improvement compared to conventional P2P distributed training methods.
Despite the cost-time trade-off, the serverless approach still holds promise due to its pay-as-you-go model.
arXiv Detail & Related papers (2023-09-25T13:51:07Z) - How Can We Train Deep Learning Models Across Clouds and Continents? An Experimental Study [57.97785297481162]
We evaluate the cost and throughput implications of training in different zones, continents, and clouds for representative CV, NLP, and ASR models.
We show how leveraging spot pricing enables a new cost-efficient way to train models with multiple cheap instance, trumping both more centralized and powerful hardware and even on-demand cloud offerings at competitive prices.
arXiv Detail & Related papers (2023-06-05T18:17:37Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Analysis of Distributed Deep Learning in the Cloud [17.91202259637393]
We introduce a comprehensive distributed deep learning (DDL) profiler, which can determine the various execution "stalls" that DDL suffers from while running on a public cloud.
We estimate two types of communication stalls - interconnect and network stalls.
We train popular DNN models using the profiler to characterize various AWS GPU instances and list their advantages and shortcomings for users to make an informed decision.
arXiv Detail & Related papers (2022-08-30T15:42:36Z) - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design.
It is currently not possible to deploy AI models on network devices due to their limited computational capabilities.
We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z) - Asynchronous Parallel Incremental Block-Coordinate Descent for
Decentralized Machine Learning [55.198301429316125]
Machine learning (ML) is a key technique for big-data-driven modelling and analysis of massive Internet of Things (IoT) based intelligent and ubiquitous computing.
For fast-increasing applications and data amounts, distributed learning is a promising emerging paradigm since it is often impractical or inefficient to share/aggregate data.
This paper studies the problem of training an ML model over decentralized systems, where data are distributed over many user devices.
arXiv Detail & Related papers (2022-02-07T15:04:15Z) - ProgFed: Effective, Communication, and Computation Efficient Federated Learning by Progressive Training [65.68511423300812]
We propose ProgFed, a progressive training framework for efficient and effective federated learning.
ProgFed inherently reduces computation and two-way communication costs while maintaining the strong performance of the final models.
Our results show that ProgFed converges at the same rate as standard training on full models.
arXiv Detail & Related papers (2021-10-11T14:45:00Z) - DANCE: DAta-Network Co-optimization for Efficient Segmentation Model
Training and Inference [85.02494022662505]
DANCE is an automated simultaneous data-network co-optimization for efficient segmentation model training and inference.
It integrates automated data slimming which adaptively downsamples/drops input images and controls their corresponding contribution to the training loss guided by the images' spatial complexity.
Experiments and ablating studies demonstrate that DANCE can achieve "all-win" towards efficient segmentation.
arXiv Detail & Related papers (2021-07-16T04:58:58Z) - Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed
Learning over Directed & Time-Varying Graphs with non-IID Datasets [2.518955020930418]
We propose Sparse-Push, a communication efficient decentralized distributed training algorithm.
The proposed algorithm enables 466x reduction in communication with only 1% degradation in performance.
We demonstrate how communication compression can lead to significant performance degradation in-case of non-IID datasets.
arXiv Detail & Related papers (2021-02-10T19:41:11Z) - Characterizing and Modeling Distributed Training with Transient Cloud
GPU Servers [6.56704851092678]
We analyze distributed training performance under diverse cluster configurations using CM-DARE.
Our empirical datasets include measurements from three GPU types, six geographic regions, twenty convolutional neural networks, and thousands of Google Cloud servers.
We also demonstrate the feasibility of predicting training speed and overhead using regression-based models.
arXiv Detail & Related papers (2020-04-07T01:49:58Z) - Machine Learning on Volatile Instances [40.19551148721116]
This work is the first to quantify how variations in the number of active worker nodes (as a result of preemption) affects SGD convergence and the time to train the model.
We propose cost-effective strategies to exploit volatile cloud instances that are cheaper than standard instances.
arXiv Detail & Related papers (2020-03-12T07:47:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.