Related papers: A Look Into Training Large Language Models on Next Generation Datacenters

A Look Into Training Large Language Models on Next Generation Datacenters

URL: http://arxiv.org/abs/2407.12819v1
Date: Mon, 1 Jul 2024 10:33:46 GMT
Title: A Look Into Training Large Language Models on Next Generation Datacenters
Authors: Alexandru M. Gherghescu, Vlad-Andrei Bădoiu, Alexandru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, Costin Raiciu,
Abstract summary: We take an unconventional approach to finding relevant research directions, by starting from Microsoft's plans to build a $100 billion datacenter for ML. Our goal is to understand what models could be trained in such a datacenter, as well as the high-level challenges one may encounter in doing so. We conclude that building the datacenter and training such models is technically possible, but this requires a novel NIC-based multipath transport along with a redesign of the entire training stack.
Score: 70.3084616806354
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Is it still worth doing computer networking research? What are relevant problems in this space given the supremacy of hyperscalers in deployed large networks? We take an unconventional approach to finding relevant research directions, by starting from Microsoft's plans to build a $100 billion datacenter for ML. Our goal is to understand what models could be trained in such a datacenter, as well as the high-level challenges one may encounter in doing so. We first examine the constraints imposed by cooling and power requirements for our target datacenter and find that it is infeasible to build in a single location. We use LLM scaling laws to determine that we could train models of 50T or 100T. Finally, we examine how distributed training might work for these models, and what the networking requirements are. We conclude that building the datacenter and training such models is technically possible, but this requires a novel NIC-based multipath transport along with a redesign of the entire training stack, outlining a research agenda for our community in the near future.

Related papers

Brain-Inspired Decentralized Satellite Learning in Space Computing Power Networks [42.67808523367945]
Space Computing Power Networks (Space-CPN) emerges as a promising architecture to coordinate the computing capability of satellites and enable on board data processing. We propose to employ spiking neural networks (SNNs), which is supported by the neuromorphic computing architecture, for on-board data processing. We put forward a decentralized neuromorphic learning framework, where a communication-efficient inter-plane model aggregation method is developed.
arXiv Detail & Related papers (2025-01-27T12:29:47Z)
Towards Human-Guided, Data-Centric LLM Co-Pilots [53.35493881390917]
CliMB-DC is a human-guided, data-centric framework for machine learning co-pilots. It combines advanced data-centric tools with LLM-driven reasoning to enable robust, context-aware data processing. We show how CliMB-DC can transform uncurated datasets into ML-ready formats.
arXiv Detail & Related papers (2025-01-17T17:51:22Z)
Generalizability of Graph Neural Networks for Decentralized Unlabeled Motion Planning [72.86540018081531]
Unlabeled motion planning involves assigning a set of robots to target locations while ensuring collision avoidance. This problem forms an essential building block for multi-robot systems in applications such as exploration, surveillance, and transportation. We address this problem in a decentralized setting where each robot knows only the positions of its $k$-nearest robots and $k$-nearest targets.
arXiv Detail & Related papers (2024-09-29T23:57:25Z)
Prioritising Interactive Flows in Data Center Networks With Central Control [0.0]
We deal with two problems relating to central controller assisted prioritization of interactive flow in data center networks. In the first part of the thesis, we deal with the problem of congestion control in a software defined network. We propose a framework, where the controller with its global view of the network actively participates in the congestion control decisions of the end TCP hosts.
arXiv Detail & Related papers (2023-10-27T07:15:15Z)
Machine Learning-Based User Scheduling in Integrated Satellite-HAPS-Ground Networks [82.58968700765783]
Integrated space-air-ground networks promise to offer a valuable solution space for empowering the sixth generation of communication networks (6G) This paper showcases the prospects of machine learning in the context of user scheduling in integrated space-air-ground communications.
arXiv Detail & Related papers (2022-05-27T13:09:29Z)
A review of Federated Learning in Intrusion Detection Systems for IoT [0.15469452301122172]
Intrusion detection systems are evolving into intelligent systems that perform data analysis searching for anomalies in their environment. Deep learning technologies opened the door to build more complex and effective threat detection models. Current approaches rely on powerful centralized servers that receive data from all their parties. This paper focuses on the application of Federated Learning approaches in the field of Intrusion Detection.
arXiv Detail & Related papers (2022-04-26T17:00:07Z)
Machine Learning Empowered Intelligent Data Center Networking: A Survey [35.55535885962517]
This paper comprehensively investigates the application of machine learning to data center networking. It covers flow prediction, flow classification, load balancing, resource management, routing optimization, and congestion control. We design a quality assessment criteria called REBEL-3S to impartially measure the strengths and weaknesses of these research works.
arXiv Detail & Related papers (2022-02-28T05:27:22Z)
Learning Connectivity for Data Distribution in Robot Teams [96.39864514115136]
We propose a task-agnostic, decentralized, low-latency method for data distribution in ad-hoc networks using Graph Neural Networks (GNN) Our approach enables multi-agent algorithms based on global state information to function by ensuring it is available at each robot. We train the distributed GNN communication policies via reinforcement learning using the average Age of Information as the reward function and show that it improves training stability compared to task-specific reward functions.
arXiv Detail & Related papers (2021-03-08T21:48:55Z)
Batch Exploration with Examples for Scalable Robotic Reinforcement Learning [63.552788688544254]
Batch Exploration with Examples (BEE) explores relevant regions of the state-space guided by a modest number of human provided images of important states. BEE is able to tackle challenging vision-based manipulation tasks both in simulation and on a real Franka robot.
arXiv Detail & Related papers (2020-10-22T17:49:25Z)
Scalable Learning Paradigms for Data-Driven Wireless Communication [45.03425546213185]
We aim to provide a systematic discussion on the building blocks of scalable data-driven wireless networks. On one hand, we discuss the forward-looking architecture and computing framework of scalable data-driven systems from a global perspective. On the other hand, we discuss the learning algorithms and model training strategies performed at each individual node from a local perspective.
arXiv Detail & Related papers (2020-03-01T12:13:58Z)
Deep Learning for Ultra-Reliable and Low-Latency Communications in 6G Networks [84.2155885234293]
We first summarize how to apply data-driven supervised deep learning and deep reinforcement learning in URLLC. To address these open problems, we develop a multi-level architecture that enables device intelligence, edge intelligence, and cloud intelligence for URLLC.
arXiv Detail & Related papers (2020-02-22T14:38:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.