GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes
- URL: http://arxiv.org/abs/2512.12091v1
- Date: Fri, 12 Dec 2025 23:46:05 GMT
- Title: GraphPerf-RT: A Graph-Driven Performance Model for Hardware-Aware Scheduling of OpenMP Codes
- Authors: Mohammad Pivezhandi, Mahdi Banisharif, Saeed Bakhshan, Abusayeed Saifullah, Ali Jannesari,
- Abstract summary: We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context.<n>We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi)<n>Experiments show that GraphPerf-RT achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines.
- Score: 7.204137900029132
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Performance prediction for OpenMP workloads on heterogeneous embedded SoCs is challenging due to complex interactions between task DAG structure, control-flow irregularity, cache and branch behavior, and thermal dynamics; classical heuristics struggle under workload irregularity, tabular regressors discard structural information, and model-free RL risks overheating resource-constrained devices. We introduce GraphPerf-RT, the first surrogate that unifies task DAG topology, CFG-derived code semantics, and runtime context (per-core DVFS, thermal state, utilization) in a heterogeneous graph representation with typed edges encoding precedence, placement, and contention. Multi-task evidential heads predict makespan, energy, cache and branch misses, and utilization with calibrated uncertainty (Normal-Inverse-Gamma), enabling risk-aware scheduling that filters low-confidence rollouts. We validate GraphPerf-RT on three embedded ARM platforms (Jetson TX2, Jetson Orin NX, RUBIK Pi), achieving R^2 > 0.95 with well-calibrated uncertainty (ECE < 0.05). To demonstrate end-to-end scheduling utility, we integrate the surrogate with four RL methods on Jetson TX2: single-agent model-free (SAMFRL), single-agent model-based (SAMBRL), multi-agent model-free (MAMFRL-D3QN), and multi-agent model-based (MAMBRL-D3QN). Experiments across 5 seeds (200 episodes each) show that MAMBRL-D3QN with GraphPerf-RT as the world model achieves 66% makespan reduction (0.97 +/- 0.35s) and 82% energy reduction (0.006 +/- 0.005J) compared to model-free baselines, demonstrating that accurate, uncertainty-aware surrogates enable effective model-based planning on thermally constrained embedded systems.
Related papers
- Accurate Network Traffic Matrix Prediction via LEAD: a Large Language Model-Enhanced Adapter-Based Conditional Diffusion Model [41.23125529149133]
We present LEAD, a conditional Diffusion model that transforms traffic matrices into RGB images.<n>We also propose a Dual-Conditioning Strategy to guide a diffusion model to generate complex, dynamic network traffic.<n>Experiments on the Abilene and GEANT datasets demonstrate that LEAD outperforms all baselines.
arXiv Detail & Related papers (2026-01-29T09:16:05Z) - LiQSS: Post-Transformer Linear Quantum-Inspired State-Space Tensor Networks for Real-Time 6G [85.58816960936069]
Proactive and agentic control in Sixth-Generation (6G) Open Radio Access Networks (O-RAN) requires control-grade prediction under stringent Near-Time (Near-RT) latency and computational constraints.<n>This paper investigates a post-Transformer paradigm for efficient radio telemetry forecasting.<n>We propose a quantum-inspired state-space tensor network that replaces self-attention with stable structured state-space dynamics kernels.
arXiv Detail & Related papers (2026-01-18T12:08:38Z) - ZeroDVFS: Zero-Shot LLM-Guided Core and Frequency Allocation for Embedded Platforms [7.633618497843279]
We propose a model-based hierarchical multi-agent reinforcement learning (MARL) framework for thermal- and energy-aware scheduling on multi-core platforms.<n>First-decision latency is 8,300x faster than table-based profiling, enabling practical deployment in dynamic embedded systems.
arXiv Detail & Related papers (2026-01-13T02:56:06Z) - Energy Scaling Laws for Diffusion Models: Quantifying Compute and Carbon Emissions in Image Generation [50.21021246855702]
We propose an adaptation of Kaplan scaling laws to predict GPU energy consumption for diffusion models based on computational complexity (FLOPs)<n>Our approach decomposes diffusion model inference into text encoding, iterative denoising, and decoding components, with the hypothesis that denoising operations dominate energy consumption due to their repeated execution across multiple inference steps.<n>Our results validate the compute-bound nature of diffusion inference and provide a foundation for sustainable AI deployment planning and carbon footprint estimation.
arXiv Detail & Related papers (2025-11-21T08:12:47Z) - Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning [70.56067503630486]
We argue that sixth-generation (6G) intelligence is not fluent token prediction but calibrated the capacity to imagine and choose.<n>We show that WM-MS3M cuts mean absolute error (MAE) by 1.69% versus MS3M with 32% fewer parameters and similar latency, and achieves 35-80% lower root mean squared error (RMSE) than attention/hybrid baselines with 2.3-4.1x faster inference.
arXiv Detail & Related papers (2025-11-04T17:22:22Z) - From Embeddings to Equations: Genetic-Programming Surrogates for Interpretable Transformer Classification [9.17282078449475]
We study symbolic surrogate modeling of frozen Transformer embeddings to obtain compact, auditable classifiers with calibrated probabilities.<n>For five benchmarks (SST2G, 20NG, MNIST, CIFAR10, MSC17), embeddings from ModernBERT, DINOv2, and SigLIP are partitioned on the training set into disjoint, information-preserving views.<n>A cooperative multi-population genetic program (MEGP) then learns additive, closed-form logit programs over these views.
arXiv Detail & Related papers (2025-09-16T02:17:04Z) - PRISM: Distributed Inference for Foundation Models at Edge [73.54372283220444]
PRISM is a communication-efficient and compute-aware strategy for distributed Transformer inference on edge devices.<n>We evaluate PRISM on ViT, BERT, and GPT-2 across diverse datasets.
arXiv Detail & Related papers (2025-07-16T11:25:03Z) - Constraints and Variables Reduction for Optimal Power Flow Using Hierarchical Graph Neural Networks with Virtual Node-Splitting [0.24554686192257422]
Power system networks are often modeled as homogeneous graphs, which limits the ability of graph neural network (GNN) to capture individual generator features at the same nodes.<n>By introducing the proposed virtual node-splitting strategy, generator-level attributes like costs, limits, and ramp rates can be fully captured by GNN models.<n>Two-stage adaptive hierarchical GNN is developed to (i) predict critical lines that would be congested, and then (ii) predict base generators that would operate at the maximum capacity.
arXiv Detail & Related papers (2024-11-09T19:46:28Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Application Performance Modeling via Tensor Completion [6.399089940376445]
We show that low-rank canonical-polyadic (CP) tensor decomposition is effective in approximating these tensors.
We then employ tensor completion to optimize a CP decomposition given a sparse set of observed execution times.
arXiv Detail & Related papers (2022-10-18T22:12:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.