GraphCC: A Practical Graph Learning-based Approach to Congestion Control
in Datacenters
- URL: http://arxiv.org/abs/2308.04905v1
- Date: Wed, 9 Aug 2023 12:04:41 GMT
- Title: GraphCC: A Practical Graph Learning-based Approach to Congestion Control
in Datacenters
- Authors: Guillermo Bern\'ardez, Jos\'e Su\'arez-Varela, Xiang Shi, Shihan Xiao,
Xiangle Cheng, Pere Barlet-Ros, Albert Cabellos-Aparicio
- Abstract summary: Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN)
This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization.
- Score: 6.47712691414707
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Congestion Control (CC) plays a fundamental role in optimizing traffic in
Data Center Networks (DCN). Currently, DCNs mainly implement two main CC
protocols: DCTCP and DCQCN. Both protocols -- and their main variants -- are
based on Explicit Congestion Notification (ECN), where intermediate switches
mark packets when they detect congestion. The ECN configuration is thus a
crucial aspect on the performance of CC protocols. Nowadays, network experts
set static ECN parameters carefully selected to optimize the average network
performance. However, today's high-speed DCNs experience quick and abrupt
changes that severely change the network state (e.g., dynamic traffic
workloads, incast events, failures). This leads to under-utilization and
sub-optimal performance. This paper presents GraphCC, a novel Machine
Learning-based framework for in-network CC optimization. Our distributed
solution relies on a novel combination of Multi-agent Reinforcement Learning
(MARL) and Graph Neural Networks (GNN), and it is compatible with widely
deployed ECN-based CC protocols. GraphCC deploys distributed agents on switches
that communicate with their neighbors to cooperate and optimize the global ECN
configuration. In our evaluation, we test the performance of GraphCC under a
wide variety of scenarios, focusing on the capability of this solution to adapt
to new scenarios unseen during training (e.g., new traffic workloads, failures,
upgrades). We compare GraphCC with a state-of-the-art MARL-based solution for
ECN tuning -- ACC -- and observe that our proposed solution outperforms the
state-of-the-art baseline in all of the evaluation scenarios, showing
improvements up to $20\%$ in Flow Completion Time as well as significant
reductions in buffer occupancy ($38.0-85.7\%$).
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - AdaRC: Mitigating Graph Structure Shifts during Test-Time [66.40525136929398]
Test-time adaptation (TTA) has attracted attention due to its ability to adapt a pre-trained model to a target domain without re-accessing the source domain.
We propose AdaRC, an innovative framework designed for effective and efficient adaptation to structure shifts in graphs.
arXiv Detail & Related papers (2024-10-09T15:15:40Z) - FG-SAT: Efficient Flow Graph for Encrypted Traffic Classification under Environment Shifts [19.76017462160707]
Encrypted traffic classification plays a critical role in network security and management.
Existing methods fail to recognize the critical link between transport layer mechanisms and applications.
We propose FG-SAT, the first end-to-end method for encrypted traffic analysis under environment shifts.
arXiv Detail & Related papers (2024-08-26T09:11:36Z) - ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions.
Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks.
We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z) - MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion
Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms.
Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return.
We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z) - Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design.
It is currently not possible to deploy AI models on network devices due to their limited computational capabilities.
We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z) - IMDeception: Grouped Information Distilling Super-Resolution Network [7.6146285961466]
Single-Image-Super-Resolution (SISR) is a classical computer vision problem that has benefited from the recent advancements in deep learning methods.
In this work, we propose the Global Progressive Refinement Module (GPRM) as a less parameter-demanding alternative to the IIC module for feature aggregation.
We also propose Grouped Information Distilling Blocks (GIDB) to further decrease the number of parameters and floating point operations persecond (FLOPS)
Experiments reveal that the proposed network performs on par with state-of-the-art models despite having a limited number of parameters and FLOPS
arXiv Detail & Related papers (2022-04-25T06:43:45Z) - Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput.
Until today, no such learning-based algorithms have shown practical potential in this domain.
We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks.
We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z) - Caramel: Accelerating Decentralized Distributed Deep Learning with
Computation Scheduling [1.5785002371773138]
Caramel is a system that accelerates distributed deep learning through model-aware scheduling and communication optimizations for AllReduce.
Caramel maintains the correctness of the dataflow model, is hardware-independent, and does not require any user-level or framework-level changes.
arXiv Detail & Related papers (2020-04-29T08:32:33Z) - Decentralized SGD with Over-the-Air Computation [13.159777131162961]
We study the performance of decentralized numerically gradient descent (DSGD) in a wireless network.
We assume that transmissions are prone to additive noise and interference.
We show that the OAC-MAC scheme attains better convergence performance with a fewer communication rounds.
arXiv Detail & Related papers (2020-03-06T15:33:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.