Related papers: Intelligent colocation of HPC workloads

Intelligent colocation of HPC workloads

URL: http://arxiv.org/abs/2103.09019v1
Date: Tue, 16 Mar 2021 12:35:35 GMT
Title: Intelligent colocation of HPC workloads
Authors: Felippe V. Zacarias (1, 2 and 3), Vinicius Petrucci (1 and 5), Rajiv Nishtala (4), Paul Carpenter (3) and Daniel Moss\'e (5) ((1) Universidade Federal da Bahia, (2) Universitat Polit\`ecnica de Catalunya, (3) Barcelona Supercomputing Center, (4) Coop, Norway/Norwegian University of Science and Technology, Norway, (5) University of Pittsburgh)
Abstract summary: Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique is to colocate multiple applications on the same server. We show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Many HPC applications suffer from a bottleneck in the shared caches, instruction execution units, I/O or memory bandwidth, even though the remaining resources may be underutilized. It is hard for developers and runtime systems to ensure that all critical resources are fully exploited by a single application, so an attractive technique for increasing HPC system utilization is to colocate multiple applications on the same server. When applications share critical resources, however, contention on shared resources may lead to reduced application performance. In this paper, we show that server efficiency can be improved by first modeling the expected performance degradation of colocated applications based on measured hardware performance counters, and then exploiting the model to determine an optimized mix of colocated applications. This paper presents a new intelligent resource manager and makes the following contributions: (1) a new machine learning model to predict the performance degradation of colocated applications based on hardware counters and (2) an intelligent scheduling scheme deployed on an existing resource manager to enable application co-scheduling with minimum performance degradation. Our results show that our approach achieves performance improvements of 7% (avg) and 12% (max) compared to the standard policy commonly used by existing job managers.

Related papers

A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings. Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features. Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z)
Secure Resource Allocation via Constrained Deep Reinforcement Learning [49.15061461220109]
We present SARMTO, a framework that balances resource allocation, task offloading, security, and performance. SARMTO consistently outperforms five baseline approaches, achieving up to a 40% reduction in system costs. These enhancements highlight SARMTO's potential to revolutionize resource management in intricate distributed computing environments.
arXiv Detail & Related papers (2025-01-20T15:52:43Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs) MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z)
A Reinforcement Learning Approach for Performance-aware Reduction in Power Consumption of Data Center Compute Nodes [0.46040036610482665]
We use Reinforcement Learning to design a power capping policy on cloud compute nodes. We show how a trained agent running on actual hardware can take actions by balancing power consumption and application performance.
arXiv Detail & Related papers (2023-08-15T23:25:52Z)
PBScaler: A Bottleneck-aware Autoscaling Framework for Microservice-based Applications [6.453782169615384]
We propose PBScaler, a bottleneck-aware autoscaling framework for microservice-based applications. We show that PBScaler outperforms existing approaches while conserving resources efficiently.
arXiv Detail & Related papers (2023-03-26T04:20:17Z)
Dynamic Resource Allocation for Metaverse Applications with Deep Reinforcement Learning [64.75603723249837]
This work proposes a novel framework to dynamically manage and allocate different types of resources for Metaverse applications. We first propose an effective solution to divide applications into groups, namely MetaInstances, where common functions can be shared among applications. Then, to capture the real-time, dynamic, and uncertain characteristics of request arrival and application departure processes, we develop a semi-Markov decision process-based framework.
arXiv Detail & Related papers (2023-02-27T00:30:01Z)
Heterogeneous Data-Centric Architectures for Modern Data-Intensive Applications: Case Studies in Machine Learning and Databases [9.927754948343326]
processing-in-memory (PIM) is a promising execution paradigm that alleviates the data movement bottleneck in modern applications. In this paper, we show how to take advantage of the PIM paradigm for two modern data-intensive applications.
arXiv Detail & Related papers (2022-05-29T13:43:17Z)
U-Boost NAS: Utilization-Boosted Differentiable Neural Architecture Search [50.33956216274694]
optimizing resource utilization in target platforms is key to achieving high performance during DNN inference. We propose a novel hardware-aware NAS framework that does not only optimize for task accuracy and inference latency, but also for resource utilization. We achieve 2.8 - 4x speedup for DNN inference compared to prior hardware-aware NAS methods.
arXiv Detail & Related papers (2022-03-23T13:44:15Z)
Optimising Resource Management for Embedded Machine Learning [23.00896228073755]
Machine learning inference is increasingly being executed locally on mobile and embedded platforms. We show approaches for online resource management in heterogeneous multi-core systems.
arXiv Detail & Related papers (2021-05-08T06:10:05Z)
Optimizing Deep Learning Recommender Systems' Training On CPU Cluster Architectures [56.69373580921888]
We focus on Recommender Systems which account for most of the AI cycles in cloud computing centers. By enabling it to run on latest CPU hardware and software tailored for HPC, we are able to achieve more than two-orders of magnitude improvement in performance.
arXiv Detail & Related papers (2020-05-10T14:40:16Z)
The Case for Learning Application Behavior to Improve Hardware Energy Efficiency [2.4425948078034847]
We propose to use the harvested knowledge to tune hardware configurations. Our proposed approach, called FORECASTER, uses a deep learning model to learn what configuration of hardware resources provides the optimal energy efficiency for a certain behavior of an application. Our results show that FORECASTER can save as much as 18.4% system power over the baseline set up with all resources.
arXiv Detail & Related papers (2020-04-27T18:11:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.