Online Job Failure Prediction in an HPC System
- URL: http://arxiv.org/abs/2308.15481v1
- Date: Fri, 30 Jun 2023 07:40:59 GMT
- Title: Online Job Failure Prediction in an HPC System
- Authors: Francesco Antici, Andrea Borghesi, and Zeynep Kiziltan
- Abstract summary: The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy.
Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption.
Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system.
- Score: 2.2284709230738544
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Modern High Performance Computing (HPC) systems are complex machines, with
major impacts on economy and society. Along with their computational
capability, their energy consumption is also steadily raising, representing a
critical issue given the ongoing environmental and energetic crisis. Therefore,
developing strategies to optimize HPC system management has paramount
importance, both to guarantee top-tier performance and to improve energy
efficiency. One strategy is to act at the workload level and highlight the jobs
that are most likely to fail, prior to their execution on the system. Jobs
failing during their execution unnecessarily occupy resources which could delay
other jobs, adversely affecting the system performance and energy consumption.
In this paper, we study job failure prediction at submit-time using classical
machine learning algorithms. Our novelty lies in (i) the combination of these
algorithms with Natural Language Processing (NLP) tools to represent jobs and
(ii) the design of the approach to work in an online fashion in a real system.
The study is based on a dataset extracted from a production machine hosted at
the HPC centre CINECA in Italy. Experimental results show that our approach is
promising.
Related papers
- Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy.
This paper addresses the challenge of reducing energy consumption in data centers running LLMs.
We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z) - Solving Boltzmann Optimization Problems with Deep Learning [0.21485350418225244]
The Ising model shows particular promise as a future framework for highly energy efficient computation.
Ising systems are able to operate at energies approaching thermodynamic limits for energy consumption of computation.
The challenge in creating Ising-based hardware is in optimizing useful circuits that produce correct results on fundamentally nondeterministic hardware.
arXiv Detail & Related papers (2024-01-30T19:52:02Z) - Computation-efficient Deep Learning for Computer Vision: A Survey [121.84121397440337]
Deep learning models have reached or even exceeded human-level performance in a range of visual perception tasks.
Deep learning models usually demand significant computational resources, leading to impractical power consumption, latency, or carbon emissions in real-world scenarios.
New research focus is computationally efficient deep learning, which strives to achieve satisfactory performance while minimizing the computational cost during inference.
arXiv Detail & Related papers (2023-08-27T03:55:28Z) - A Comparative Study of Machine Learning Algorithms for Anomaly Detection
in Industrial Environments: Performance and Environmental Impact [62.997667081978825]
This study seeks to address the demands of high-performance machine learning models with environmental sustainability.
Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance.
However, superior outcomes were obtained with optimised configurations, albeit with a commensurate increase in resource consumption.
arXiv Detail & Related papers (2023-07-01T15:18:00Z) - Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A
Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption.
Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy.
We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z) - Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously.
We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously.
Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z) - Optimization paper production through digitalization by developing an
assistance system for machine operators including quality forecast: a concept [50.591267188664666]
The production of paper from waste paper is still a highly resource intensive task, especially in terms of energy consumption.
We have identified a lack of utilization of it and implement a concept using an operator assistance system and state-of-the-art machine learning techniques.
Our main objective is to provide situation-specific knowledge to machine operators utilizing available data.
arXiv Detail & Related papers (2022-06-23T09:54:35Z) - Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network
Accelerator [3.224364382976958]
This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration.
Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and workloads.
arXiv Detail & Related papers (2022-04-20T21:56:50Z) - AI Chiller: An Open IoT Cloud Based Machine Learning Framework for the
Energy Saving of Building HVAC System via Big Data Analytics on the Fusion of
BMS and Environmental Data [12.681421165031576]
Energy saving and carbon emission reduction in buildings is one of the key measures in combating climate change.
The optimization of chiller system power consumption had been extensively studied in the mechanical engineering and building service domains.
With the advance of big data and AI, the adoption of machine learning into the optimization problems becomes popular.
arXiv Detail & Related papers (2020-10-09T09:51:03Z) - Risk-Aware Energy Scheduling for Edge Computing with Microgrid: A
Multi-Agent Deep Reinforcement Learning Approach [82.6692222294594]
We study a risk-aware energy scheduling problem for a microgrid-powered MEC network.
We derive the solution by applying a multi-agent deep reinforcement learning (MADRL)-based advantage actor-critic (A3C) algorithm with shared neural networks.
arXiv Detail & Related papers (2020-02-21T02:14:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.