Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
- URL: http://arxiv.org/abs/2504.20854v1
- Date: Tue, 29 Apr 2025 15:23:55 GMT
- Title: Towards Easy and Realistic Network Infrastructure Testing for Large-scale Machine Learning
- Authors: Jinsun Yoo, ChonLam Lao, Lianjie Cao, Bob Lantz, Minlan Yu, Tushar Krishna, Puneet Sharma,
- Abstract summary: This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance.<n> Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.
- Score: 7.872558491633935
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper lays the foundation for Genie, a testing framework that captures the impact of real hardware network behavior on ML workload performance, without requiring expensive GPUs. Genie uses CPU-initiated traffic over a hardware testbed to emulate GPU to GPU communication, and adapts the ASTRA-sim simulator to model interaction between the network and the ML workload.
Related papers
- RouteNet-Gauss: Hardware-Enhanced Network Modeling with Machine Learning [5.381741076460799]
This paper introduces RouteNet-Gauss, a novel integration of a testbed network with a Machine Learning (ML) model to address these challenges.
By using the testbed as a hardware accelerator, RouteNet-Gauss generates training datasets rapidly and simulates network scenarios with high fidelity to real-world conditions.
Experimental results show that RouteNet-Gauss significantly reduces prediction errors by up to 95% and achieves a 488x speedup in inference time compared to state-of-the-art DES-based methods.
arXiv Detail & Related papers (2025-01-15T15:00:11Z) - WDMoE: Wireless Distributed Mixture of Experts for Large Language Models [68.45482959423323]
Large Language Models (LLMs) have achieved significant success in various natural language processing tasks.
We propose a wireless distributed Mixture of Experts (WDMoE) architecture to enable collaborative deployment of LLMs across edge servers at the base station (BS) and mobile devices in wireless networks.
arXiv Detail & Related papers (2024-11-11T02:48:00Z) - Gaussian Splatting to Real World Flight Navigation Transfer with Liquid Networks [93.38375271826202]
We present a method to improve generalization and robustness to distribution shifts in sim-to-real visual quadrotor navigation tasks.
We first build a simulator by integrating Gaussian splatting with quadrotor flight dynamics, and then, train robust navigation policies using Liquid neural networks.
In this way, we obtain a full-stack imitation learning protocol that combines advances in 3D Gaussian splatting radiance field rendering, programming of expert demonstration training data, and the task understanding capabilities of Liquid networks.
arXiv Detail & Related papers (2024-06-21T13:48:37Z) - Towards Universal Performance Modeling for Machine Learning Training on Multi-GPU Platforms [4.959530958049395]
We develop a pipeline to Characterize and predict the training performance of modern machine learning (ML) workloads on compute systems.
Our pipeline generalizes to other types of ML workloads, such as Transformer-based NLP models.
It is capable of generating insights such as quickly selecting the fastest embedding table sharding configuration.
arXiv Detail & Related papers (2024-04-19T07:20:33Z) - Using the Abstract Computer Architecture Description Language to Model
AI Hardware Accelerators [77.89070422157178]
Manufacturers of AI-integrated products face a critical challenge: selecting an accelerator that aligns with their product's performance requirements.
The Abstract Computer Architecture Description Language (ACADL) is a concise formalization of computer architecture block diagrams.
In this paper, we demonstrate how to use the ACADL to model AI hardware accelerators, use their ACADL description to map DNNs onto them, and explain the timing simulation semantics to gather performance results.
arXiv Detail & Related papers (2024-01-30T19:27:16Z) - In Situ Framework for Coupling Simulation and Machine Learning with
Application to CFD [51.04126395480625]
Recent years have seen many successful applications of machine learning (ML) to facilitate fluid dynamic computations.
As simulations grow, generating new training datasets for traditional offline learning creates I/O and storage bottlenecks.
This work offers a solution by simplifying this coupling and enabling in situ training and inference on heterogeneous clusters.
arXiv Detail & Related papers (2023-06-22T14:07:54Z) - Sim-to-Real Transfer in Multi-agent Reinforcement Networking for
Federated Edge Computing [11.3251009653699]
Federated Learning (FL) over wireless multi-hop edge computing networks is a cost-effective distributed on-device deep learning paradigm.
This paper presents FedEdge simulator, a high-fidelity Linux-based simulator, which enables fast prototyping, sim-to-real code, and knowledge transfer for multi-hop FL systems.
arXiv Detail & Related papers (2021-10-18T00:21:07Z) - SimNet: Computer Architecture Simulation using Machine Learning [3.7019798164954336]
This work describes a concerted effort, where machine learning (ML) is used to accelerate discrete-event simulation.
A GPU-accelerated parallel simulator is implemented based on the proposed instruction latency predictor.
Its simulation accuracy and throughput are validated and evaluated against a state-of-the-art simulator.
arXiv Detail & Related papers (2021-05-12T17:31:52Z) - Using Machine Learning at Scale in HPC Simulations with SmartSim: An
Application to Ocean Climate Modeling [52.77024349608834]
We demonstrate the first climate-scale, numerical ocean simulations improved through distributed, online inference of Deep Neural Networks (DNN) using SmartSim.
SmartSim is a library dedicated to enabling online analysis and Machine Learning (ML) for traditional HPC simulations.
arXiv Detail & Related papers (2021-04-13T19:27:28Z) - Usage of Network Simulators in Machine-Learning-Assisted 5G/6G Networks [9.390329421385415]
We devise the role of network simulators for bridging the gap between Machine Learning and communications systems.
We present an architectural integration of simulators in ML-aware networks for training, testing, and validating ML models before being applied to the operative network.
We illustrate the integration of network simulators into ML-assisted communications through a proof-of-concept testbed implementation of a residential Wi-Fi network.
arXiv Detail & Related papers (2020-05-17T15:32:59Z) - Taurus: A Data Plane Architecture for Per-Packet ML [59.1343317736213]
We present the design and implementation of Taurus, a data plane for line-rate inference.
Our evaluation of a Taurus switch ASIC shows that Taurus operates orders of magnitude faster than a server-based control plane.
arXiv Detail & Related papers (2020-02-12T09:18:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.