Related papers: Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol

URL: http://arxiv.org/abs/2305.04279v1
Date: Sun, 7 May 2023 14:01:52 GMT
Title: Boosting Distributed Machine Learning Training Through Loss-tolerant Transmission Protocol
Authors: Zixuan Chen, Lei Shi, Xuandong Liu, Xin Ai, Sen Liu, and Yang Xu
Abstract summary: Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. PS communication architecture faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. textbfLoss-tolerant textbfTransmission textbfProcol allows partial loss of gradients during synchronization to avoid unneeded retransmission. textitEarly Close adjusts the loss-tolerant threshold based on network conditions and textit
Score: 11.161913989794257
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Distributed Machine Learning (DML) systems are utilized to enhance the speed of model training in data centers (DCs) and edge nodes. The Parameter Server (PS) communication architecture is commonly employed, but it faces severe long-tail latency caused by many-to-one "incast" traffic patterns, negatively impacting training throughput. To address this challenge, we design the \textbf{L}oss-tolerant \textbf{T}ransmission \textbf{P}rotocol (LTP), which permits partial loss of gradients during synchronization to avoid unneeded retransmission and contributes to faster synchronization per iteration. LTP implements loss-tolerant transmission through \textit{out-of-order transmission} and \textit{out-of-order Acknowledges (ACKs)}. LTP employs \textit{Early Close} to adjust the loss-tolerant threshold based on network conditions and \textit{Bubble Filling} for data correction to maintain training accuracy. LTP is implemented by C++ and integrated into PyTorch. Evaluations on a testbed of 8 worker nodes and one PS node demonstrate that LTP can significantly improve DML training task throughput by up to 30x compared to traditional TCP congestion controls, with no sacrifice to final accuracy.

Related papers

Communication Efficient ConFederated Learning: An Event-Triggered SAGA Approach [67.27031215756121]
Federated learning (FL) is a machine learning paradigm that targets model training without gathering the local data over various data sources. Standard FL, which employs a single server, can only support a limited number of users, leading to degraded learning capability. In this work, we consider a multi-server FL framework, referred to as emphConfederated Learning (CFL) in order to accommodate a larger number of users.
arXiv Detail & Related papers (2024-02-28T03:27:10Z)
Efficient Asynchronous Federated Learning with Sparsification and Quantization [55.6801207905772]
Federated Learning (FL) is attracting more and more attention to collaboratively train a machine learning model without transferring raw data. FL generally exploits a parameter server and a large number of edge devices during the whole process of the model training. We propose TEASQ-Fed to exploit edge devices to asynchronously participate in the training process by actively applying for tasks.
arXiv Detail & Related papers (2023-12-23T07:47:07Z)
Adaptive Federated Pruning in Hierarchical Wireless Networks [69.6417645730093]
Federated Learning (FL) is a privacy-preserving distributed learning framework where a server aggregates models updated by multiple devices without accessing their private datasets. In this paper, we introduce model pruning for HFL in wireless networks to reduce the neural network scale. We show that our proposed HFL with model pruning achieves similar learning accuracy compared with the HFL without model pruning and reduces about 50 percent communication cost.
arXiv Detail & Related papers (2023-05-15T22:04:49Z)
Efficient Federated Learning with Enhanced Privacy via Lottery Ticket Pruning in Edge Computing [19.896989498650207]
Federated learning (FL) is a collaborative learning paradigm for decentralized private data from mobile terminals (MTs) Existing privacy-preserving methods usually adopt the instance-level differential privacy (DP) We propose Fed-enhanced FL framework with underlinetextbfLottery underlinetextbfTicket underlinetextbfHypothesis (LTH) and zero-concentrated DunderlinetextbfP (zCDP)
arXiv Detail & Related papers (2023-05-02T13:02:09Z)
Efficient Parallel Split Learning over Resource-constrained Wireless Edge Networks [44.37047471448793]
In this paper, we advocate the integration of edge computing paradigm and parallel split learning (PSL) We propose an innovative PSL framework, namely, efficient parallel split learning (EPSL) to accelerate model training. We show that the proposed EPSL framework significantly decreases the training latency needed to achieve a target accuracy.
arXiv Detail & Related papers (2023-03-26T16:09:48Z)
Fixed-Weight Difference Target Propagation [12.559727665706687]
We present Fixed-Weight Difference Target Propagation (FW-DTP) that keeps the feedback weights constant during training. FW-DTP consistently achieves higher test performance than a baseline, the Difference Target propagation (DTP) on four classification datasets. We also present a novel propagation architecture that explains the exact form of the feedback function of DTP to analyze FW-DTP.
arXiv Detail & Related papers (2022-12-19T13:34:36Z)
Teal: Learning-Accelerated Optimization of WAN Traffic Engineering [68.7863363109948]
We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. To reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand. Compared with other TE acceleration schemes, Teal satisfies 6--32% more traffic demand and yields 197--625x speedups.
arXiv Detail & Related papers (2022-10-25T04:46:30Z)
OFedQIT: Communication-Efficient Online Federated Learning via Quantization and Intermittent Transmission [7.6058140480517356]
Online federated learning (OFL) is a promising framework to collaboratively learn a sequence of non-linear functions (or models) from distributed streaming data. We propose a communication-efficient OFL algorithm (named OFedQIT) by means of a quantization and an intermittent transmission. Our analysis reveals that OFedQIT successfully addresses the drawbacks of OFedAvg while maintaining superior learning accuracy.
arXiv Detail & Related papers (2022-05-13T07:46:43Z)
Training Recommender Systems at Scale: Communication-Efficient Model and Data Parallelism [56.78673028601739]
We propose a compression framework called Dynamic Communication Thresholding (DCT) for communication-efficient hybrid training. DCT reduces communication by at least $100times$ and $20times$ during DP and MP, respectively. It improves end-to-end training time for a state-of-the-art industrial recommender model by 37%, without any loss in performance.
arXiv Detail & Related papers (2020-10-18T01:44:42Z)
Domain-specific Communication Optimization for Distributed DNN Training [10.781867496460837]
We present DLCP, a novel solution exploiting the domain-specific properties of deep learning to optimize communication overhead of DNN training in a fine-grained manner. It exploits em bounded loss tolerance of SGD-based training to improve tail communication latency which cannot be avoided purely through gradient compression. It then performs fine-grained packet-level prioritization and dropping, as opposed to flow-level scheduling, based on layers and magnitudes of gradients to further speedup model convergence without affecting accuracy.
arXiv Detail & Related papers (2020-08-16T09:53:21Z)
Straggler-aware Distributed Learning: Communication Computation Latency Trade-off [56.08535873173518]
Straggling workers can be tolerated by assigning redundant computations and coding across data and computations. In most existing schemes, each non-straggling worker transmits one message per iteration to the parameter server (PS) after completing all its computations. Imposing such a limitation results in two main drawbacks; over-computation due to inaccurate prediction of the straggling behaviour, and under-utilization due to treating workers as straggler/non-straggler.
arXiv Detail & Related papers (2020-04-10T08:39:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.