I've Got 99 Problems But FLOPS Ain't One
- URL: http://arxiv.org/abs/2407.12819v2
- Date: Wed, 23 Oct 2024 14:00:36 GMT
- Title: I've Got 99 Problems But FLOPS Ain't One
- Authors: Alexandru M. Gherghescu, Vlad-Andrei Bădoiu, Alexandru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, Costin Raiciu,
- Abstract summary: We take an unconventional approach to find relevant research directions, starting from public plans to build a $100 billion datacenter for machine learning applications.
We discover what workloads such a datacenter might carry and explore the challenges one may encounter in doing so, with a focus on networking research.
We conclude that building the datacenter and training such models is technically possible, but this requires novel wide-area transports for inter-DC communication, a multipath transport and novel datacenter topologies.
- Score: 70.3084616806354
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Hyperscalers dominate the landscape of large network deployments, yet they rarely share data or insights about the challenges they face. In light of this supremacy, what problems can we find to solve in this space? We take an unconventional approach to find relevant research directions, starting from public plans to build a $100 billion datacenter for machine learning applications. Leveraging the language models scaling laws, we discover what workloads such a datacenter might carry and explore the challenges one may encounter in doing so, with a focus on networking research. We conclude that building the datacenter and training such models is technically possible, but this requires novel wide-area transports for inter-DC communication, a multipath transport and novel datacenter topologies for intra-datacenter communication, high speed scale-up networks and transports, outlining a rich research agenda for the networking community.
Related papers
- Generalizability of Graph Neural Networks for Decentralized Unlabeled Motion Planning [72.86540018081531]
Unlabeled motion planning involves assigning a set of robots to target locations while ensuring collision avoidance.
This problem forms an essential building block for multi-robot systems in applications such as exploration, surveillance, and transportation.
We address this problem in a decentralized setting where each robot knows only the positions of its $k$-nearest robots and $k$-nearest targets.
arXiv Detail & Related papers (2024-09-29T23:57:25Z) - Prioritising Interactive Flows in Data Center Networks With Central
Control [0.0]
We deal with two problems relating to central controller assisted prioritization of interactive flow in data center networks.
In the first part of the thesis, we deal with the problem of congestion control in a software defined network.
We propose a framework, where the controller with its global view of the network actively participates in the congestion control decisions of the end TCP hosts.
arXiv Detail & Related papers (2023-10-27T07:15:15Z) - Machine Learning-Based User Scheduling in Integrated
Satellite-HAPS-Ground Networks [82.58968700765783]
Integrated space-air-ground networks promise to offer a valuable solution space for empowering the sixth generation of communication networks (6G)
This paper showcases the prospects of machine learning in the context of user scheduling in integrated space-air-ground communications.
arXiv Detail & Related papers (2022-05-27T13:09:29Z) - A review of Federated Learning in Intrusion Detection Systems for IoT [0.15469452301122172]
Intrusion detection systems are evolving into intelligent systems that perform data analysis searching for anomalies in their environment.
Deep learning technologies opened the door to build more complex and effective threat detection models.
Current approaches rely on powerful centralized servers that receive data from all their parties.
This paper focuses on the application of Federated Learning approaches in the field of Intrusion Detection.
arXiv Detail & Related papers (2022-04-26T17:00:07Z) - Machine Learning Empowered Intelligent Data Center Networking: A Survey [35.55535885962517]
This paper comprehensively investigates the application of machine learning to data center networking.
It covers flow prediction, flow classification, load balancing, resource management, routing optimization, and congestion control.
We design a quality assessment criteria called REBEL-3S to impartially measure the strengths and weaknesses of these research works.
arXiv Detail & Related papers (2022-02-28T05:27:22Z) - Learning Connectivity for Data Distribution in Robot Teams [96.39864514115136]
We propose a task-agnostic, decentralized, low-latency method for data distribution in ad-hoc networks using Graph Neural Networks (GNN)
Our approach enables multi-agent algorithms based on global state information to function by ensuring it is available at each robot.
We train the distributed GNN communication policies via reinforcement learning using the average Age of Information as the reward function and show that it improves training stability compared to task-specific reward functions.
arXiv Detail & Related papers (2021-03-08T21:48:55Z) - Batch Exploration with Examples for Scalable Robotic Reinforcement
Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states.
BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z) - Scalable Learning Paradigms for Data-Driven Wireless Communication [45.03425546213185]
We aim to provide a systematic discussion on the building blocks of scalable data-driven wireless networks.
On one hand, we discuss the forward-looking architecture and computing framework of scalable data-driven systems from a global perspective.
On the other hand, we discuss the learning algorithms and model training strategies performed at each individual node from a local perspective.
arXiv Detail & Related papers (2020-03-01T12:13:58Z) - Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G
Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC.
To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.