HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling
- URL: http://arxiv.org/abs/2508.20016v2
- Date: Thu, 28 Aug 2025 01:16:49 GMT
- Title: HPC Digital Twins for Evaluating Scheduling Policies, Incentive Structures and their Impact on Power and Cooling
- Authors: Matthias Maiterth, Wesley H. Brewer, Jaya S. Kuruvella, Arunavo Dey, Tanzima Z. Islam, Kevin Menear, Dmitry Duplyakin, Rashadul Kabir, Tapasya Patki, Terry Jones, Feiyi Wang,
- Abstract summary: We present the first-of-its-kind integration of scheduling and digital twins in HPC.<n>This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets.
- Score: 0.9681568030660136
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Schedulers are critical for optimal resource utilization in high-performance computing. Traditional methods to evaluate schedulers are limited to post-deployment analysis, or simulators, which do not model associated infrastructure. In this work, we present the first-of-its-kind integration of scheduling and digital twins in HPC. This enables what-if studies to understand the impact of parameter configurations and scheduling decisions on the physical assets, even before deployment, or regarching changes not easily realizable in production. We (1) provide the first digital twin framework extended with scheduling capabilities, (2) integrate various top-tier HPC systems given their publicly available datasets, (3) implement extensions to integrate external scheduling simulators. Finally, we show how to (4) implement and evaluate incentive structures, as-well-as (5) evaluate machine learning based scheduling, in such novel digital-twin based meta-framework to prototype scheduling. Our work enables what-if scenarios of HPC systems to evaluate sustainability, and the impact on the simulated system.
Related papers
- What Artificial Intelligence can do for High-Performance Computing systems? [0.0]
This review assesses how artificial (AI) including machine learning (ML) and optimization, improves the efficiency of operational HPC systems.<n>Approximately 1,800 publications from 2019 to 2025 were manually screened using predefined inclusion/exclusion criteria.<n>74 "AI for HPC" papers were retained and grouped into six application areas: performance estimation, performance optimization, scheduling, surrogate modeling, fault detection, and language-model-based automation.
arXiv Detail & Related papers (2026-01-03T19:25:23Z) - Optimizing Fairness in Production Planning: A Human-Centric Approach to Machine and Workforce Allocation [55.71151342699622]
The proposed system is validated through 16 test sessions with domain experts from the automotive industry.<n>Results indicate that the CP-based scheduling approach produces compact, feasible production plans with low tardiness.
arXiv Detail & Related papers (2025-10-01T16:41:18Z) - K2-Think: A Parameter-Efficient Reasoning System [80.62468969966133]
K2-Think is a reasoning system that achieves state-of-the-art performance with a 32B parameter model.<n>Our system shows that smaller models can compete at the highest levels by combining advanced post-training and test-time techniques.<n>K2-Think is freely available at k2think.ai, offering best-in-class inference speeds of over 2,000 tokens per second per request.
arXiv Detail & Related papers (2025-09-09T11:25:55Z) - Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling [6.623504719591386]
Large Language Model (LLM)-based scheduler uses ReAct-style framework (Reason + Act)<n>System incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback.<n>We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven real-world HPC workload scenarios.
arXiv Detail & Related papers (2025-05-29T14:25:29Z) - Data Scaling Laws for End-to-End Autonomous Driving [83.85463296830743]
We evaluate the performance of a simple end-to-end driving architecture on internal driving datasets ranging in size from 16 to 8192 hours.<n>Specifically, we investigate how much additional training data is needed to achieve a target performance gain.
arXiv Detail & Related papers (2025-04-06T03:23:48Z) - Rethinking Resource Management in Edge Learning: A Joint Pre-training and Fine-tuning Design Paradigm [87.47506806135746]
In some applications, edge learning is experiencing a shift in focusing from conventional learning from scratch to new two-stage learning.
This paper considers the problem of joint communication and computation resource management in a two-stage edge learning system.
It is shown that the proposed joint resource management over the pre-training and fine-tuning stages well balances the system performance trade-off.
arXiv Detail & Related papers (2024-04-01T00:21:11Z) - A digital twin framework for civil engineering structures [0.6249768559720122]
The digital twin concept represents an appealing opportunity to advance condition-based and predictive maintenance paradigms.
This work proposes a predictive digital twin approach to the health monitoring, maintenance, and management planning of civil engineering structures.
arXiv Detail & Related papers (2023-08-02T21:38:36Z) - A Dynamic Feedforward Control Strategy for Energy-efficient Building
System Operation [59.56144813928478]
In current control strategies and optimization algorithms, most of them rely on receiving information from real-time feedback.
We propose an engineer-friendly control strategy framework that embeds dynamic prior knowledge from building system characteristics simultaneously for system control.
We tested it in a case for heating system control with typical control strategies, which shows our framework owns a further energy-saving potential of 15%.
arXiv Detail & Related papers (2023-01-23T09:07:07Z) - Federated Stochastic Gradient Descent Begets Self-Induced Momentum [151.4322255230084]
Federated learning (FL) is an emerging machine learning method that can be applied in mobile edge systems.
We show that running to the gradient descent (SGD) in such a setting can be viewed as adding a momentum-like term to the global aggregation process.
arXiv Detail & Related papers (2022-02-17T02:01:37Z) - SOLIS -- The MLOps journey from data acquisition to actionable insights [62.997667081978825]
In this paper we present a unified deployment pipeline and freedom-to-operate approach that supports all requirements while using basic cross-platform tensor framework and script language engines.
This approach however does not supply the needed procedures and pipelines for the actual deployment of machine learning capabilities in real production grade systems.
arXiv Detail & Related papers (2021-12-22T14:45:37Z) - An Extensible Benchmark Suite for Learning to Simulate Physical Systems [60.249111272844374]
We introduce a set of benchmark problems to take a step towards unified benchmarks and evaluation protocols.
We propose four representative physical systems, as well as a collection of both widely used classical time-based and representative data-driven methods.
arXiv Detail & Related papers (2021-08-09T17:39:09Z) - A Scalable and Reproducible System-on-Chip Simulation for Reinforcement
Learning [0.0]
This paper proffers gym-ds3, a scalable and reproducible open environment tailored for a high-fidelity Domain-Specific System-on-Chip (DSSoC) application.
The simulation corroborates to schedule hierarchical jobs onto heterogeneous System-on-Chip (SoC) processors and bridges the system to reinforcement learning research.
arXiv Detail & Related papers (2021-04-27T13:46:57Z) - Deep Reinforcement Agent for Scheduling in HPC [1.6569798882223303]
Cluster scheduler determines when and which user jobs should be allocated to available system resources.
In this work, we present an automated HPC scheduling agent named DRAS (Deep Reinforcement Agent for Scheduling) by leveraging deep reinforcement learning.
arXiv Detail & Related papers (2021-02-11T20:08:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.