Related papers: GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters

GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters

URL: http://arxiv.org/abs/2510.15652v1
Date: Fri, 17 Oct 2025 13:44:10 GMT
Title: GOGH: Correlation-Guided Orchestration of GPUs in Heterogeneous Clusters
Authors: Ahmad Raeisi, Mahdi Dolati, Sina Darabi, Sadegh Talebi, Patrick Eugster, Ahmad Khonsari,
Abstract summary: We propose a learning-based architecture for managing machine learning workloads in heterogeneous clusters.<n>The system operates online, allocating resources to incoming training or inference requests while minimizing energy consumption.
Score: 4.241410532880399
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The growing demand for computational resources in machine learning has made efficient resource allocation a critical challenge, especially in heterogeneous hardware clusters where devices vary in capability, age, and energy efficiency. Upgrading to the latest hardware is often infeasible, making sustainable use of existing, mixed-generation resources essential. In this paper, we propose a learning-based architecture for managing machine learning workloads in heterogeneous clusters. The system operates online, allocating resources to incoming training or inference requests while minimizing energy consumption and meeting performance requirements. It uses two neural networks: the first provides initial estimates of how well a new model will utilize different hardware types and how it will affect co-located models. An optimizer then allocates resources based on these estimates. After deployment, the system monitors real performance and uses this data to refine its predictions via a second neural network. This updated model improves estimates not only for the current hardware but also for hardware not initially allocated and for co-location scenarios not yet observed. The result is an adaptive, iterative approach that learns over time to make more effective resource allocation decisions in heterogeneous deep learning clusters.

Related papers

REDS: Resource-Efficient Deep Subnetworks for Dynamic Resource Constraints [2.9209462960232235]
State-of-the-art machine learning pipelines generate resource-agnostic models that are not capable to adapt at runtime.<n>We introduce Resource-Efficient Deep Subnetworks (REDS) to tackle model adaptation to variable resources.
arXiv Detail & Related papers (2023-11-22T12:34:51Z)
Tackling Computational Heterogeneity in FL: A Few Theoretical Insights [68.8204255655161]
We introduce and analyse a novel aggregation framework that allows for formalizing and tackling computational heterogeneous data. Proposed aggregation algorithms are extensively analyzed from a theoretical, and an experimental prospective.
arXiv Detail & Related papers (2023-07-12T16:28:21Z)
DISTREAL: Distributed Resource-Aware Learning in Heterogeneous Systems [2.1506382989223782]
We study the problem of distributed training of neural networks (NNs) on devices with heterogeneous, limited, and time-varying availability of computational resources. We present an adaptive, resource-aware, on-device learning mechanism, DISTREAL, which is able to fully and efficiently utilize the available resources.
arXiv Detail & Related papers (2021-12-16T10:15:31Z)
A New Clustering-Based Technique for the Acceleration of Deep Convolutional Networks [2.7393821783237184]
Model Compression and Acceleration (MCA) techniques are used to transform large pre-trained networks into smaller models. We propose a clustering-based approach that is able to increase the number of employed centroids/representatives. This is achieved by imposing a special structure to the employed representatives, which is enabled by the particularities of the problem at hand.
arXiv Detail & Related papers (2021-07-19T18:22:07Z)
Learning to Continuously Optimize Wireless Resource in a Dynamic Environment: A Bilevel Optimization Perspective [52.497514255040514]
This work develops a new approach that enables data-driven methods to continuously learn and optimize resource allocation strategies in a dynamic environment. We propose to build the notion of continual learning into wireless system design, so that the learning model can incrementally adapt to the new episodes. Our design is based on a novel bilevel optimization formulation which ensures certain fairness" across different data samples.
arXiv Detail & Related papers (2021-05-03T07:23:39Z)
Phase Retrieval using Expectation Consistent Signal Recovery Algorithm based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems. Recent advances in deep learning have opened up a new possibility for robust and fast PR. We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
Solving Mixed Integer Programs Using Neural Networks [57.683491412480635]
This paper applies learning to the two key sub-tasks of a MIP solver, generating a high-quality joint variable assignment, and bounding the gap in objective value between that assignment and an optimal one. Our approach constructs two corresponding neural network-based components, Neural Diving and Neural Branching, to use in a base MIP solver such as SCIP. We evaluate our approach on six diverse real-world datasets, including two Google production datasets and MIPLIB, by training separate neural networks on each.
arXiv Detail & Related papers (2020-12-23T09:33:11Z)
Learning Centric Wireless Resource Allocation for Edge Computing: Algorithm and Experiment [15.577056429740951]
Edge intelligence is an emerging network architecture that integrates sensing, communication, computing components, and supports various machine learning applications. Existing methods ignore two important facts: 1) different models have heterogeneous demands on training data; 2) there is a mismatch between the simulated environment and the real-world environment. This paper proposes the learning centric wireless resource allocation scheme that maximizes the worst learning performance of multiple tasks.
arXiv Detail & Related papers (2020-10-29T06:20:40Z)
Spiking Neural Networks Hardware Implementations and Challenges: a Survey [53.429871539789445]
Spiking Neural Networks are cognitive algorithms mimicking neuron and synapse operational principles. We present the state of the art of hardware implementations of spiking neural networks. We discuss the strategies employed to leverage the characteristics of these event-driven algorithms at the hardware level.
arXiv Detail & Related papers (2020-05-04T13:24:00Z)
Large-Scale Gradient-Free Deep Learning with Recursive Local Representation Alignment [84.57874289554839]
Training deep neural networks on large-scale datasets requires significant hardware resources. Backpropagation, the workhorse for training these networks, is an inherently sequential process that is difficult to parallelize. We propose a neuro-biologically-plausible alternative to backprop that can be used to train deep networks.
arXiv Detail & Related papers (2020-02-10T16:20:02Z)
Resource-Efficient Neural Networks for Embedded Systems [23.532396005466627]
We provide an overview of the current state of the art of machine learning techniques. We focus on resource-efficient inference based on deep neural networks (DNNs), the predominant machine learning models of the past decade. We substantiate our discussion with experiments on well-known benchmark data sets using compression techniques.
arXiv Detail & Related papers (2020-01-07T14:17:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.