Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC
- URL: http://arxiv.org/abs/2601.15399v1
- Date: Wed, 21 Jan 2026 19:11:12 GMT
- Title: Attention-Informed Surrogates for Navigating Power-Performance Trade-offs in HPC
- Authors: Ashna Nawar Ahmed, Banooqa Banday, Terry Jones, Tanzima Z. Islam,
- Abstract summary: We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision.<n>Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques.<n>To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem.
- Score: 0.5219568203653523
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: High-Performance Computing (HPC) schedulers must balance user performance with facility-wide resource constraints. The task boils down to selecting the optimal number of nodes for a given job. We present a surrogate-assisted multi-objective Bayesian optimization (MOBO) framework to automate this complex decision. Our core hypothesis is that surrogate models informed by attention-based embeddings of job telemetry can capture performance dynamics more effectively than standard regression techniques. We pair this with an intelligent sample acquisition strategy to ensure the approach is data-efficient. On two production HPC datasets, our embedding-informed method consistently identified higher-quality Pareto fronts of runtime-power trade-offs compared to baselines. Furthermore, our intelligent data sampling strategy drastically reduced training costs while improving the stability of the results. To our knowledge, this is the first work to successfully apply embedding-informed surrogates in a MOBO framework to the HPC scheduling problem, jointly optimizing for performance and power on production workloads.
Related papers
- AceGRPO: Adaptive Curriculum Enhanced Group Relative Policy Optimization for Autonomous Machine Learning Engineering [52.67783579040657]
AceGRPO is a machine learning system that prioritizes tasks at the agent's learning frontier to maximize learning efficiency.<n>Our trained Ace-30B model achieves a 100% valid submission rate on MLE-Bench-Lite, approaches the performance of proprietary frontier models, and outperforms larger open-source baselines.
arXiv Detail & Related papers (2026-02-08T10:55:03Z) - LeJOT: An Intelligent Job Cost Orchestration Solution for Databricks Platform [28.16213013287002]
We introduce LeJOT, an intelligent job cost orchestration framework for Databricks jobs.<n>LeJOT proactively predicts workload demands, dynamically allocates computing resources, and minimizes costs.<n>We show that LeJOT achieves an average 20% reduction in cloud computing costs within a minute-level scheduling timeframe.
arXiv Detail & Related papers (2025-12-20T08:09:58Z) - A Foundation Model for Massive MIMO Precoding with an Adaptive per-User Rate-Power Tradeoff [4.8310710966636545]
We propose a transformer-based foundation model for mMIMO precoding that seeks to minimize the energy consumption of the transmitter while dynamically adapting to per-user rate requirements.<n>At equal energy consumption, zero-shot deployment of the proposed foundation model significantly outperforms zero forcing, and approaches weighted minimum mean squared error performance with 8x less complexity.<n>Our work enables the implementation of DL-based solutions in practice by addressing challenges of data availability and training complexity.
arXiv Detail & Related papers (2025-07-24T17:10:06Z) - Evaluating the Efficacy of LLM-Based Reasoning for Multiobjective HPC Job Scheduling [6.375075345747834]
Large Language Model (LLM)-based scheduler using ReAct-style framework (Reason + Act)<n>System incorporates a scratchpad memory to track scheduling history and refine decisions via natural language feedback.<n>We evaluate our approach using OpenAI's O4-Mini and Anthropic's Claude 3.7 across seven real-world HPC workload scenarios.
arXiv Detail & Related papers (2025-05-29T14:25:29Z) - Preference Optimization for Combinatorial Optimization Problems [54.87466279363487]
Reinforcement Learning (RL) has emerged as a powerful tool for neural optimization, enabling models learns that solve complex problems without requiring expert knowledge.<n>Despite significant progress, existing RL approaches face challenges such as diminishing reward signals and inefficient exploration in vast action spaces.<n>We propose Preference Optimization, a novel method that transforms quantitative reward signals into qualitative preference signals via statistical comparison modeling.
arXiv Detail & Related papers (2025-05-13T16:47:00Z) - DARS: Dynamic Action Re-Sampling to Enhance Coding Agent Performance by Adaptive Tree Traversal [55.13854171147104]
Large Language Models (LLMs) have revolutionized various domains, including natural language processing, data analysis, and software development.<n>We present Dynamic Action Re-Sampling (DARS), a novel inference time compute scaling approach for coding agents.<n>We evaluate our approach on SWE-Bench Lite benchmark, demonstrating that this scaling strategy achieves a pass@k score of 55% with Claude 3.5 Sonnet V2.
arXiv Detail & Related papers (2025-03-18T14:02:59Z) - ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning [50.53705050673944]
We propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs.<n>Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization.<n>We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-03-08T07:03:43Z) - Learning-enabled Flexible Job-shop Scheduling for Scalable Smart
Manufacturing [11.509669981978874]
In smart manufacturing systems, flexible job-shop scheduling with transportation constraints is essential to optimize solutions for maximizing productivity.
Recent developments in deep reinforcement learning (DRL)-based methods for FJSPT have encountered a scale generalization challenge.
We introduce a novel graph-based DRL method, named the Heterogeneous Graph Scheduler (HGS)
arXiv Detail & Related papers (2024-02-14T06:49:23Z) - Machine Learning Insides OptVerse AI Solver: Design Principles and
Applications [74.67495900436728]
We present a comprehensive study on the integration of machine learning (ML) techniques into Huawei Cloud's OptVerse AI solver.
We showcase our methods for generating complex SAT and MILP instances utilizing generative models that mirror multifaceted structures of real-world problem.
We detail the incorporation of state-of-the-art parameter tuning algorithms which markedly elevate solver performance.
arXiv Detail & Related papers (2024-01-11T15:02:15Z) - Orchestration of Emulator Assisted Mobile Edge Tuning for AI Foundation
Models: A Multi-Agent Deep Reinforcement Learning Approach [10.47302625959368]
We present a groundbreaking paradigm integrating Mobile Edge Computing with foundation models, specifically designed to enhance local task performance on user equipment (UE)
Central to our approach is the innovative Emulator-Adapter architecture, segmenting the foundation model into two cohesive modules.
We introduce an advanced resource allocation mechanism that is fine-tuned to the needs of the Emulator-Adapter structure in decentralized settings.
arXiv Detail & Related papers (2023-10-26T15:47:51Z) - Mastering the Unsupervised Reinforcement Learning Benchmark from Pixels [112.63440666617494]
Reinforcement learning algorithms can succeed but require large amounts of interactions between the agent and the environment.
We propose a new method to solve it, using unsupervised model-based RL, for pre-training the agent.
We show robust performance on the Real-Word RL benchmark, hinting at resiliency to environment perturbations during adaptation.
arXiv Detail & Related papers (2022-09-24T14:22:29Z) - On Effective Scheduling of Model-based Reinforcement Learning [53.027698625496015]
We propose a framework named AutoMBPO to automatically schedule the real data ratio.
In this paper, we first theoretically analyze the role of real data in policy training, which suggests that gradually increasing the ratio of real data yields better performance.
arXiv Detail & Related papers (2021-11-16T15:24:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.