Related papers: Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

Impact of RoCE Congestion Control Policies on Distributed Training of DNNs

URL: http://arxiv.org/abs/2207.10898v1
Date: Fri, 22 Jul 2022 06:29:17 GMT
Title: Impact of RoCE Congestion Control Policies on Distributed Training of DNNs
Authors: Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali, Aditya Akella, Tushar Krishna
Abstract summary: We analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads.
Score: 7.573461420853252
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: RDMA over Converged Ethernet (RoCE) has gained significant attraction for datacenter networks due to its compatibility with conventional Ethernet-based fabric. However, the RDMA protocol is efficient only on (nearly) lossless networks, emphasizing the vital role of congestion control on RoCE networks. Unfortunately, the native RoCE congestion control scheme, based on Priority Flow Control (PFC), suffers from many drawbacks such as unfairness, head-of-line-blocking, and deadlock. Therefore, in recent years many schemes have been proposed to provide additional congestion control for RoCE networks to minimize PFC drawbacks. However, these schemes are proposed for general datacenter environments. In contrast to the general datacenters that are built using commodity hardware and run general-purpose workloads, high-performance distributed training platforms deploy high-end accelerators and network components and exclusively run training workloads using collectives (All-Reduce, All-To-All) communication libraries for communication. Furthermore, these platforms usually have a private network, separating their communication traffic from the rest of the datacenter traffic. Scalable topology-aware collective algorithms are inherently designed to avoid incast patterns and balance traffic optimally. These distinct features necessitate revisiting previously proposed congestion control schemes for general-purpose datacenter environments. In this paper, we thoroughly analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms. Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads, motivating the necessity of designing an optimized, yet low-overhead, congestion control scheme based on the characteristics of distributed training platforms and workloads.

Related papers

Semi-decentralized Training of Spatio-Temporal Graph Neural Networks for Traffic Prediction [0.15978270011184256]
We explore and adapt semi-decentralized training techniques for Spatiotemporal Graph-Temporal Neural Networks (ST-GNNs) in smart mobility domain. We implement a simulation framework where sensors are grouped by proximity into multiple cloudlets. We show that semi-decentralized setups are comparable to centralized approaches in performance metrics.
arXiv Detail & Related papers (2024-12-04T10:20:21Z)
Communication-Control Codesign for Large-Scale Wireless Networked Control Systems [80.30532872347668]
Wireless Networked Control Systems (WNCSs) are essential to Industry 4.0, enabling flexible control in applications, such as drone swarms and autonomous robots. We propose a practical WNCS model that captures correlated dynamics among multiple control loops with spatially distributed sensors and actuators sharing limited wireless resources over multi-state Markov block-fading channels. We develop a Deep Reinforcement Learning (DRL) algorithm that efficiently handles the hybrid action space, captures communication-control correlations, and ensures robust training despite sparse cross-domain variables and floating control inputs.
arXiv Detail & Related papers (2024-10-15T06:28:21Z)
Constrained Reinforcement Learning for Adaptive Controller Synchronization in Distributed SDN [7.277944770202078]
This work focuses on examining deep reinforcement learning (DRL) techniques, encompassing both value-based and policy-based methods, to guarantee an upper latency threshold for AR/VR task offloading. Our evaluation results indicate that while value-based methods excel in optimizing individual network metrics such as latency or load balancing, policy-based approaches exhibit greater robustness in adapting to sudden network changes or reconfiguration.
arXiv Detail & Related papers (2024-01-21T21:57:22Z)
Prioritising Interactive Flows in Data Center Networks With Central Control [0.0]
We deal with two problems relating to central controller assisted prioritization of interactive flow in data center networks. In the first part of the thesis, we deal with the problem of congestion control in a software defined network. We propose a framework, where the controller with its global view of the network actively participates in the congestion control decisions of the end TCP hosts.
arXiv Detail & Related papers (2023-10-27T07:15:15Z)
GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters [6.47712691414707]
Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN) This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization.
arXiv Detail & Related papers (2023-08-09T12:04:41Z)
A Deep Reinforcement Learning Framework for Optimizing Congestion Control in Data Centers [2.310582065745938]
Various congestion control protocols have been designed to achieve high performance in different network environments. Modern online learning solutions that delegate the congestion control actions to a machine cannot properly converge in the stringent time scales of data centers. We leverage multiagent reinforcement learning to design a system for dynamic tuning of congestion control parameters at end-hosts in a data center.
arXiv Detail & Related papers (2023-01-29T22:08:35Z)
Fair and Efficient Distributed Edge Learning with Hybrid Multipath TCP [62.81300791178381]
The bottleneck of distributed edge learning over wireless has shifted from computing to communication. Existing TCP-based data networking schemes for DEL are application-agnostic and fail to deliver adjustments according to application layer requirements. We develop a hybrid multipath TCP (MP TCP) by combining model-based and deep reinforcement learning (DRL) based MP TCP for DEL.
arXiv Detail & Related papers (2022-11-03T09:08:30Z)
Machine Learning-Based User Scheduling in Integrated Satellite-HAPS-Ground Networks [82.58968700765783]
Integrated space-air-ground networks promise to offer a valuable solution space for empowering the sixth generation of communication networks (6G) This paper showcases the prospects of machine learning in the context of user scheduling in integrated space-air-ground communications.
arXiv Detail & Related papers (2022-05-27T13:09:29Z)
Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)
Decentralized Control with Graph Neural Networks [147.84766857793247]
We propose a novel framework using graph neural networks (GNNs) to learn decentralized controllers. GNNs are well-suited for the task since they are naturally distributed architectures and exhibit good scalability and transferability properties. The problems of flocking and multi-agent path planning are explored to illustrate the potential of GNNs in learning decentralized controllers.
arXiv Detail & Related papers (2020-12-29T18:59:14Z)
CFR-RL: Traffic Engineering with Reinforcement Learning in SDN [5.718975715943091]
We propose a Reinforcement-based scheme that learns a policy to select critical flows for each given traffic matrix automatically. CFR-RL achieves near-optimal performance by rerouting only 10%-21.3% of total traffic.
arXiv Detail & Related papers (2020-04-24T20:46:54Z)
Decentralized Learning for Channel Allocation in IoT Networks over Unlicensed Bandwidth as a Contextual Multi-player Multi-armed Bandit Game [134.88020946767404]
We study a decentralized channel allocation problem in an ad-hoc Internet of Things network underlaying on the spectrum licensed to a primary cellular network. Our study maps this problem into a contextual multi-player, multi-armed bandit game, and proposes a purely decentralized, three-stage policy learning algorithm through trial-and-error.
arXiv Detail & Related papers (2020-03-30T10:05:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.