Impact of RoCE Congestion Control Policies on Distributed Training of
DNNs
- URL: http://arxiv.org/abs/2207.10898v1
- Date: Fri, 22 Jul 2022 06:29:17 GMT
- Title: Impact of RoCE Congestion Control Policies on Distributed Training of
DNNs
- Authors: Tarannum Khan, Saeed Rashidi, Srinivas Sridharan, Pallavi Shurpali,
Aditya Akella, Tushar Krishna
- Abstract summary: We analyze some of the SOTA RoCE congestion control schemes vs. PFC when running on distributed training platforms.
Our results indicate that previously proposed RoCE congestion control schemes have little impact on the end-to-end performance of training workloads.
- Score: 7.573461420853252
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RDMA over Converged Ethernet (RoCE) has gained significant attraction for
datacenter networks due to its compatibility with conventional Ethernet-based
fabric. However, the RDMA protocol is efficient only on (nearly) lossless
networks, emphasizing the vital role of congestion control on RoCE networks.
Unfortunately, the native RoCE congestion control scheme, based on Priority
Flow Control (PFC), suffers from many drawbacks such as unfairness,
head-of-line-blocking, and deadlock. Therefore, in recent years many schemes
have been proposed to provide additional congestion control for RoCE networks
to minimize PFC drawbacks. However, these schemes are proposed for general
datacenter environments. In contrast to the general datacenters that are built
using commodity hardware and run general-purpose workloads, high-performance
distributed training platforms deploy high-end accelerators and network
components and exclusively run training workloads using collectives
(All-Reduce, All-To-All) communication libraries for communication.
Furthermore, these platforms usually have a private network, separating their
communication traffic from the rest of the datacenter traffic. Scalable
topology-aware collective algorithms are inherently designed to avoid incast
patterns and balance traffic optimally. These distinct features necessitate
revisiting previously proposed congestion control schemes for general-purpose
datacenter environments. In this paper, we thoroughly analyze some of the SOTA
RoCE congestion control schemes vs. PFC when running on distributed training
platforms. Our results indicate that previously proposed RoCE congestion
control schemes have little impact on the end-to-end performance of training
workloads, motivating the necessity of designing an optimized, yet
low-overhead, congestion control scheme based on the characteristics of
distributed training platforms and workloads.
Related papers
- Communication-Control Codesign for Large-Scale Wireless Networked Control Systems [80.30532872347668]
Wireless Networked Control Systems (WNCSs) are essential to Industry 4.0, enabling flexible control in applications, such as drone swarms and autonomous robots.
We propose a practical WNCS model that captures correlated dynamics among multiple control loops with spatially distributed sensors and actuators sharing limited wireless resources over multi-state Markov block-fading channels.
We develop a Deep Reinforcement Learning (DRL) algorithm that efficiently handles the hybrid action space, captures communication-control correlations, and ensures robust training despite sparse cross-domain variables and floating control inputs.
arXiv Detail & Related papers (2024-10-15T06:28:21Z) - Constrained Reinforcement Learning for Adaptive Controller Synchronization in Distributed SDN [7.277944770202078]
This work focuses on examining deep reinforcement learning (DRL) techniques, encompassing both value-based and policy-based methods, to guarantee an upper latency threshold for AR/VR task offloading.
Our evaluation results indicate that while value-based methods excel in optimizing individual network metrics such as latency or load balancing, policy-based approaches exhibit greater robustness in adapting to sudden network changes or reconfiguration.
arXiv Detail & Related papers (2024-01-21T21:57:22Z) - Prioritising Interactive Flows in Data Center Networks With Central
Control [0.0]
We deal with two problems relating to central controller assisted prioritization of interactive flow in data center networks.
In the first part of the thesis, we deal with the problem of congestion control in a software defined network.
We propose a framework, where the controller with its global view of the network actively participates in the congestion control decisions of the end TCP hosts.
arXiv Detail & Related papers (2023-10-27T07:15:15Z) - GraphCC: A Practical Graph Learning-based Approach to Congestion Control
in Datacenters [6.47712691414707]
Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN)
This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization.
arXiv Detail & Related papers (2023-08-09T12:04:41Z) - A Deep Reinforcement Learning Framework for Optimizing Congestion
Control in Data Centers [2.310582065745938]
Various congestion control protocols have been designed to achieve high performance in different network environments.
Modern online learning solutions that delegate the congestion control actions to a machine cannot properly converge in the stringent time scales of data centers.
We leverage multiagent reinforcement learning to design a system for dynamic tuning of congestion control parameters at end-hosts in a data center.
arXiv Detail & Related papers (2023-01-29T22:08:35Z) - Fair and Efficient Distributed Edge Learning with Hybrid Multipath TCP [62.81300791178381]
The bottleneck of distributed edge learning over wireless has shifted from computing to communication.
Existing TCP-based data networking schemes for DEL are application-agnostic and fail to deliver adjustments according to application layer requirements.
We develop a hybrid multipath TCP (MP TCP) by combining model-based and deep reinforcement learning (DRL) based MP TCP for DEL.
arXiv Detail & Related papers (2022-11-03T09:08:30Z) - Machine Learning-Based User Scheduling in Integrated
Satellite-HAPS-Ground Networks [82.58968700765783]
Integrated space-air-ground networks promise to offer a valuable solution space for empowering the sixth generation of communication networks (6G)
This paper showcases the prospects of machine learning in the context of user scheduling in integrated space-air-ground communications.
arXiv Detail & Related papers (2022-05-27T13:09:29Z) - Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput.
Until today, no such learning-based algorithms have shown practical potential in this domain.
We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks.
We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z) - Decentralized Control with Graph Neural Networks [147.84766857793247]
We propose a novel framework using graph neural networks (GNNs) to learn decentralized controllers.
GNNs are well-suited for the task since they are naturally distributed architectures and exhibit good scalability and transferability properties.
The problems of flocking and multi-agent path planning are explored to illustrate the potential of GNNs in learning decentralized controllers.
arXiv Detail & Related papers (2020-12-29T18:59:14Z) - CFR-RL: Traffic Engineering with Reinforcement Learning in SDN [5.718975715943091]
We propose a Reinforcement-based scheme that learns a policy to select critical flows for each given traffic matrix automatically.
CFR-RL achieves near-optimal performance by rerouting only 10%-21.3% of total traffic.
arXiv Detail & Related papers (2020-04-24T20:46:54Z) - Decentralized Learning for Channel Allocation in IoT Networks over
Unlicensed Bandwidth as a Contextual Multi-player Multi-armed Bandit Game [134.88020946767404]
We study a decentralized channel allocation problem in an ad-hoc Internet of Things network underlaying on the spectrum licensed to a primary cellular network.
Our study maps this problem into a contextual multi-player, multi-armed bandit game, and proposes a purely decentralized, three-stage policy learning algorithm through trial-and-error.
arXiv Detail & Related papers (2020-03-30T10:05:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.