SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems
- URL: http://arxiv.org/abs/2511.11111v1
- Date: Fri, 14 Nov 2025 09:39:43 GMT
- Title: SMART: A Surrogate Model for Predicting Application Runtime in Dragonfly Systems
- Authors: Xin Wang, Pietro Lodi Rizzini, Sourav Medya, Zhiling Lan,
- Abstract summary: We present ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data.<n>ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.
- Score: 13.688119091055244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The Dragonfly network, with its high-radix and low-diameter structure, is a leading interconnect in high-performance computing. A major challenge is workload interference on shared network links. Parallel discrete event simulation (PDES) is commonly used to analyze workload interference. However, high-fidelity PDES is computationally expensive, making it impractical for large-scale or real-time scenarios. Hybrid simulation that incorporates data-driven surrogate models offers a promising alternative, especially for forecasting application runtime, a task complicated by the dynamic behavior of network traffic. We present \ourmodel, a surrogate model that combines graph neural networks (GNNs) and large language models (LLMs) to capture both spatial and temporal patterns from port level router data. \ourmodel outperforms existing statistical and machine learning baselines, enabling accurate runtime prediction and supporting efficient hybrid simulation of Dragonfly networks.
Related papers
- HN-MVTS: HyperNetwork-based Multivariate Time Series Forecasting [2.4409745336261457]
HN-MVTS is a novel architecture that integrates a hypernetwork-based generative prior with an arbitrary neural network forecasting model.<n>To restrict the number of new parameters, the hypernetwork learns to generate the weights of the last layer of the target forecasting networks.<n>Experiments on eight benchmark datasets demonstrate that application of HN-MVTS to the state-of-the-art models typically improves their performance.
arXiv Detail & Related papers (2025-11-11T15:17:15Z) - A Wireless Foundation Model for Multi-Task Prediction [50.21098141769079]
We propose a unified foundation model for multi-task prediction in wireless networks that supports diverse prediction intervals.<n>After trained on large-scale datasets, the proposed foundation model demonstrates strong generalization to unseen scenarios and zero-shot performance on new tasks.
arXiv Detail & Related papers (2025-07-08T12:37:55Z) - Decoupling Spatio-Temporal Prediction: When Lightweight Large Models Meet Adaptive Hypergraphs [12.867023510751787]
STH-SepNet is a novel framework that decouples temporal and spatial expressiveness to both efficiency and precision.<n>S-SepNet offers a pragmatic and scalable solution for temporal prediction in real-world applications.<n>This work may provide a promising lightweight framework for temporal prediction, aiming to reduce computational demands and while enhancing predictive performance.
arXiv Detail & Related papers (2025-05-26T07:37:39Z) - World Model-Based Learning for Long-Term Age of Information Minimization in Vehicular Networks [53.98633183204453]
In this paper, a novel world model-based learning framework is proposed to minimize packet-completeness-aware age of information (CAoI) in a vehicular network.<n>A world model framework is proposed to jointly learn a dynamic model of the mmWave V2X environment and use it to imagine trajectories for learning how to perform link scheduling.<n>In particular, the long-term policy is learned in differentiable imagined trajectories instead of environment interactions.
arXiv Detail & Related papers (2025-05-03T06:23:18Z) - NetFlowGen: Leveraging Generative Pre-training for Network Traffic Dynamics [72.95483148058378]
We propose to pre-train a general-purpose machine learning model to capture traffic dynamics with only traffic data from NetFlow records.<n>We address challenges such as unifying network feature representations, learning from large unlabeled traffic data volume, and testing on real downstream tasks in DDoS attack detection.
arXiv Detail & Related papers (2024-12-30T00:47:49Z) - RACH Traffic Prediction in Massive Machine Type Communications [5.416701003120508]
This paper presents a machine learning-based framework tailored for forecasting bursty traffic in ALOHA networks.<n>We develop a new low-complexity online prediction algorithm that updates the states of the LSTM network by leveraging frequently collected data from the mMTC network.<n>We evaluate the performance of the proposed framework in a network with a single base station and thousands of devices organized into groups with distinct traffic-generating characteristics.
arXiv Detail & Related papers (2024-05-08T17:28:07Z) - Runtime Construction of Large-Scale Spiking Neuronal Network Models on
GPU Devices [0.0]
We propose a new method for creating network connections interactively, dynamically, and directly in GPU memory.
We validate the simulation performance with both consumer and data center GPUs on two neuroscientifically relevant models.
Both network construction and simulation times are comparable or shorter than those obtained with other state-of-the-art simulation technologies.
arXiv Detail & Related papers (2023-06-16T14:08:27Z) - Online Evolutionary Neural Architecture Search for Multivariate
Non-Stationary Time Series Forecasting [72.89994745876086]
This work presents the Online Neuro-Evolution-based Neural Architecture Search (ONE-NAS) algorithm.
ONE-NAS is a novel neural architecture search method capable of automatically designing and dynamically training recurrent neural networks (RNNs) for online forecasting tasks.
Results demonstrate that ONE-NAS outperforms traditional statistical time series forecasting methods.
arXiv Detail & Related papers (2023-02-20T22:25:47Z) - An advanced spatio-temporal convolutional recurrent neural network for
storm surge predictions [73.4962254843935]
We study the capability of artificial neural network models to emulate storm surge based on the storm track/size/intensity history.
This study presents a neural network model that can predict storm surge, informed by a database of synthetic storm simulations.
arXiv Detail & Related papers (2022-04-18T23:42:18Z) - Action-Conditional Recurrent Kalman Networks For Forward and Inverse
Dynamics Learning [17.80270555749689]
Estimating accurate forward and inverse dynamics models is a crucial component of model-based control for robots.
We present two architectures for forward model learning and one for inverse model learning.
Both architectures significantly outperform exist-ing model learning frameworks as well as analytical models in terms of prediction performance.
arXiv Detail & Related papers (2020-10-20T11:28:25Z) - Model Fusion via Optimal Transport [64.13185244219353]
We present a layer-wise model fusion algorithm for neural networks.
We show that this can successfully yield "one-shot" knowledge transfer between neural networks trained on heterogeneous non-i.i.d. data.
arXiv Detail & Related papers (2019-10-12T22:07:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.