Related papers: GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters

URL: http://arxiv.org/abs/2308.04905v1
Date: Wed, 9 Aug 2023 12:04:41 GMT
Title: GraphCC: A Practical Graph Learning-based Approach to Congestion Control in Datacenters
Authors: Guillermo Bern\'ardez, Jos\'e Su\'arez-Varela, Xiang Shi, Shihan Xiao, Xiangle Cheng, Pere Barlet-Ros, Albert Cabellos-Aparicio
Abstract summary: Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN) This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization.
Score: 6.47712691414707
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Congestion Control (CC) plays a fundamental role in optimizing traffic in Data Center Networks (DCN). Currently, DCNs mainly implement two main CC protocols: DCTCP and DCQCN. Both protocols -- and their main variants -- are based on Explicit Congestion Notification (ECN), where intermediate switches mark packets when they detect congestion. The ECN configuration is thus a crucial aspect on the performance of CC protocols. Nowadays, network experts set static ECN parameters carefully selected to optimize the average network performance. However, today's high-speed DCNs experience quick and abrupt changes that severely change the network state (e.g., dynamic traffic workloads, incast events, failures). This leads to under-utilization and sub-optimal performance. This paper presents GraphCC, a novel Machine Learning-based framework for in-network CC optimization. Our distributed solution relies on a novel combination of Multi-agent Reinforcement Learning (MARL) and Graph Neural Networks (GNN), and it is compatible with widely deployed ECN-based CC protocols. GraphCC deploys distributed agents on switches that communicate with their neighbors to cooperate and optimize the global ECN configuration. In our evaluation, we test the performance of GraphCC under a wide variety of scenarios, focusing on the capability of this solution to adapt to new scenarios unseen during training (e.g., new traffic workloads, failures, upgrades). We compare GraphCC with a state-of-the-art MARL-based solution for ECN tuning -- ACC -- and observe that our proposed solution outperforms the state-of-the-art baseline in all of the evaluation scenarios, showing improvements up to $20\%$ in Flow Completion Time as well as significant reductions in buffer occupancy ($38.0-85.7\%$).

Related papers

TrafficKAN-GCN: Graph Convolutional-based Kolmogorov-Arnold Network for Traffic Flow Optimization [21.65543843942033]
TrafficKAN-GCN is a hybrid deep learning framework combining Kolmogorov-Arnold Networks (KAN) with Graph Convolutional Networks (GCN) We evaluate the proposed framework using real-world traffic data from the Baltimore Metropolitan area. Our experiments highlight the framework's ability to redistribute traffic flow, mitigate congestion, and adapt to disruptive events, such as the Francis Scott Key Bridge collapse.
arXiv Detail & Related papers (2025-03-05T08:59:06Z)
ReInc: Scaling Training of Dynamic Graph Neural Networks [6.1592549031654364]
ReInc is a system designed to enable efficient and scalable training of Dynamic Graph Neural Networks (DGNNs) on large-scale graphs. We introduce key innovations that capitalize on the unique combination of Graph Neural Networks (GNNs) and Recurrent Neural Networks (RNNs) inherent in DGNNs.
arXiv Detail & Related papers (2025-01-25T23:16:03Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
AdaRC: Mitigating Graph Structure Shifts during Test-Time [66.40525136929398]
Test-time adaptation (TTA) has attracted attention due to its ability to adapt a pre-trained model to a target domain without re-accessing the source domain. We propose AdaRC, an innovative framework designed for effective and efficient adaptation to structure shifts in graphs.
arXiv Detail & Related papers (2024-10-09T15:15:40Z)
FG-SAT: Efficient Flow Graph for Encrypted Traffic Classification under Environment Shifts [19.76017462160707]
Encrypted traffic classification plays a critical role in network security and management. Existing methods fail to recognize the critical link between transport layer mechanisms and applications. We propose FG-SAT, the first end-to-end method for encrypted traffic analysis under environment shifts.
arXiv Detail & Related papers (2024-08-26T09:11:36Z)
ELGC-Net: Efficient Local-Global Context Aggregation for Remote Sensing Change Detection [65.59969454655996]
We propose an efficient change detection framework, ELGC-Net, which leverages rich contextual information to precisely estimate change regions. Our proposed ELGC-Net sets a new state-of-the-art performance in remote sensing change detection benchmarks. We also introduce ELGC-Net-LW, a lighter variant with significantly reduced computational complexity, suitable for resource-constrained settings.
arXiv Detail & Related papers (2024-03-26T17:46:25Z)
Learning to Sail Dynamic Networks: The MARLIN Reinforcement Learning Framework for Congestion Control in Tactical Environments [53.08686495706487]
This paper proposes an RL framework that leverages an accurate and parallelizable emulation environment to reenact the conditions of a tactical network. We evaluate our RL learning framework by training a MARLIN agent in conditions replicating a bottleneck link transition between a Satellite Communication (SATCOM) and an UHF Wide Band (UHF) radio link.
arXiv Detail & Related papers (2023-06-27T16:15:15Z)
MARLIN: Soft Actor-Critic based Reinforcement Learning for Congestion Control in Real Networks [63.24965775030673]
We propose a novel Reinforcement Learning (RL) approach to design generic Congestion Control (CC) algorithms. Our solution, MARLIN, uses the Soft Actor-Critic algorithm to maximize both entropy and return. We trained MARLIN on a real network with varying background traffic patterns to overcome the sim-to-real mismatch.
arXiv Detail & Related papers (2023-02-02T18:27:20Z)
Implementing Reinforcement Learning Datacenter Congestion Control in NVIDIA NICs [64.26714148634228]
congestion control (CC) algorithms become extremely difficult to design. It is currently not possible to deploy AI models on network devices due to their limited computational capabilities. We build a computationally-light solution based on a recent reinforcement learning CC algorithm.
arXiv Detail & Related papers (2022-07-05T20:42:24Z)
IMDeception: Grouped Information Distilling Super-Resolution Network [7.6146285961466]
Single-Image-Super-Resolution (SISR) is a classical computer vision problem that has benefited from the recent advancements in deep learning methods. In this work, we propose the Global Progressive Refinement Module (GPRM) as a less parameter-demanding alternative to the IIC module for feature aggregation. We also propose Grouped Information Distilling Blocks (GIDB) to further decrease the number of parameters and floating point operations persecond (FLOPS) Experiments reveal that the proposed network performs on par with state-of-the-art models despite having a limited number of parameters and FLOPS
arXiv Detail & Related papers (2022-04-25T06:43:45Z)
Reinforcement Learning for Datacenter Congestion Control [50.225885814524304]
Successful congestion control algorithms can dramatically improve latency and overall network throughput. Until today, no such learning-based algorithms have shown practical potential in this domain. We devise an RL-based algorithm with the aim of generalizing to different configurations of real-world datacenter networks. We show that this scheme outperforms alternative popular RL approaches, and generalizes to scenarios that were not seen during training.
arXiv Detail & Related papers (2021-02-18T13:49:28Z)
Caramel: Accelerating Decentralized Distributed Deep Learning with Computation Scheduling [1.5785002371773138]
Caramel is a system that accelerates distributed deep learning through model-aware scheduling and communication optimizations for AllReduce. Caramel maintains the correctness of the dataflow model, is hardware-independent, and does not require any user-level or framework-level changes.
arXiv Detail & Related papers (2020-04-29T08:32:33Z)
Decentralized SGD with Over-the-Air Computation [13.159777131162961]
We study the performance of decentralized numerically gradient descent (DSGD) in a wireless network. We assume that transmissions are prone to additive noise and interference. We show that the OAC-MAC scheme attains better convergence performance with a fewer communication rounds.
arXiv Detail & Related papers (2020-03-06T15:33:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.