Related papers: Online Job Failure Prediction in an HPC System

Online Job Failure Prediction in an HPC System

URL: http://arxiv.org/abs/2308.15481v1
Date: Fri, 30 Jun 2023 07:40:59 GMT
Title: Online Job Failure Prediction in an HPC System
Authors: Francesco Antici, Andrea Borghesi, and Zeynep Kiziltan
Abstract summary: The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system.
Score: 2.2284709230738544
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Modern High Performance Computing (HPC) systems are complex machines, with major impacts on economy and society. Along with their computational capability, their energy consumption is also steadily raising, representing a critical issue given the ongoing environmental and energetic crisis. Therefore, developing strategies to optimize HPC system management has paramount importance, both to guarantee top-tier performance and to improve energy efficiency. One strategy is to act at the workload level and highlight the jobs that are most likely to fail, prior to their execution on the system. Jobs failing during their execution unnecessarily occupy resources which could delay other jobs, adversely affecting the system performance and energy consumption. In this paper, we study job failure prediction at submit-time using classical machine learning algorithms. Our novelty lies in (i) the combination of these algorithms with Natural Language Processing (NLP) tools to represent jobs and (ii) the design of the approach to work in an online fashion in a real system. The study is based on a dataset extracted from a production machine hosted at the HPC centre CINECA in Italy. Experimental results show that our approach is promising.

Related papers

Energy Considerations of Large Language Model Inference and Efficiency Optimizations [28.55549828393871]
As large language models (LLMs) scale in size and adoption, their computational and environmental costs continue to rise. We systematically analyze the energy implications of common inference efficiency optimizations across diverse NLP and AI workloads. Our findings reveal that the proper application of relevant inference efficiency optimizations can reduce total energy use by up to 73% from unoptimized baselines.
arXiv Detail & Related papers (2025-04-24T15:45:05Z)
ULTHO: Ultra-Lightweight yet Efficient Hyperparameter Optimization in Deep Reinforcement Learning [50.53705050673944]
We propose ULTHO, an ultra-lightweight yet powerful framework for fast HPO in deep RL within single runs. Specifically, we formulate the HPO process as a multi-armed bandit with clustered arms (MABC) and link it directly to long-term return optimization. We test ULTHO on benchmarks including ALE, Procgen, MiniGrid, and PyBullet.
arXiv Detail & Related papers (2025-03-08T07:03:43Z)
Quantifying Energy and Cost Benefits of Hybrid Edge Cloud: Analysis of Traditional and Agentic Workloads [0.0]
This paper examines the workload distribution challenges in centralized cloud systems. It demonstrates how Hybrid Edge Cloud (HEC) mitigates these inefficiencies. Our findings reveal that HEC achieves energy savings of up to 75% and cost reductions exceeding 80%.
arXiv Detail & Related papers (2025-01-21T15:26:43Z)
Hybrid Heterogeneous Clusters Can Lower the Energy Consumption of LLM Inference Workloads [0.2389598109913753]
Training and using Large Language Models (LLMs) require large amounts of energy. This paper addresses the challenge of reducing energy consumption in data centers running LLMs. We propose a hybrid data center model that uses a cost-based scheduling framework to dynamically allocate tasks across hardware accelerators.
arXiv Detail & Related papers (2024-04-25T11:24:08Z)
Solving Boltzmann Optimization Problems with Deep Learning [0.21485350418225244]
The Ising model shows particular promise as a future framework for highly energy efficient computation. Ising systems are able to operate at energies approaching thermodynamic limits for energy consumption of computation. The challenge in creating Ising-based hardware is in optimizing useful circuits that produce correct results on fundamentally nondeterministic hardware.
arXiv Detail & Related papers (2024-01-30T19:52:02Z)
Computation-efficient Deep Learning for Computer Vision: A Survey [121.84121397440337]
Deep learning models have reached or even exceeded human-level performance in a range of visual perception tasks. Deep learning models usually demand significant computational resources, leading to impractical power consumption, latency, or carbon emissions in real-world scenarios. New research focus is computationally efficient deep learning, which strives to achieve satisfactory performance while minimizing the computational cost during inference.
arXiv Detail & Related papers (2023-08-27T03:55:28Z)
A Comparative Study of Machine Learning Algorithms for Anomaly Detection in Industrial Environments: Performance and Environmental Impact [62.997667081978825]
This study seeks to address the demands of high-performance machine learning models with environmental sustainability. Traditional machine learning algorithms, such as Decision Trees and Random Forests, demonstrate robust efficiency and performance. However, superior outcomes were obtained with optimised configurations, albeit with a commensurate increase in resource consumption.
arXiv Detail & Related papers (2023-07-01T15:18:00Z)
Sustainable AIGC Workload Scheduling of Geo-Distributed Data Centers: A Multi-Agent Reinforcement Learning Approach [48.18355658448509]
Recent breakthroughs in generative artificial intelligence have triggered a surge in demand for machine learning training, which poses significant cost burdens and environmental challenges due to its substantial energy consumption. Scheduling training jobs among geographically distributed cloud data centers unveils the opportunity to optimize the usage of computing capacity powered by inexpensive and low-carbon energy. We propose an algorithm based on multi-agent reinforcement learning and actor-critic methods to learn the optimal collaborative scheduling strategy through interacting with a cloud system built with real-life workload patterns, energy prices, and carbon intensities.
arXiv Detail & Related papers (2023-04-17T02:12:30Z)
Planning for Sample Efficient Imitation Learning [52.44953015011569]
Current imitation algorithms struggle to achieve high performance and high in-environment sample efficiency simultaneously. We propose EfficientImitate, a planning-based imitation learning method that can achieve high in-environment sample efficiency and performance simultaneously. Experimental results show that EI achieves state-of-the-art results in performance and sample efficiency.
arXiv Detail & Related papers (2022-10-18T05:19:26Z)
Optimization paper production through digitalization by developing an assistance system for machine operators including quality forecast: a concept [50.591267188664666]
The production of paper from waste paper is still a highly resource intensive task, especially in terms of energy consumption. We have identified a lack of utilization of it and implement a concept using an operator assistance system and state-of-the-art machine learning techniques. Our main objective is to provide situation-specific knowledge to machine operators utilizing available data.
arXiv Detail & Related papers (2022-06-23T09:54:35Z)
Multiply-and-Fire (MNF): An Event-driven Sparse Neural Network Accelerator [3.224364382976958]
This work takes a unique look at sparsity with an event (or activation-driven) approach to ANN acceleration. Our analytical and experimental results show that this event-driven solution presents a new direction to enable highly efficient AI inference for both CNN and workloads.
arXiv Detail & Related papers (2022-04-20T21:56:50Z)
AI Chiller: An Open IoT Cloud Based Machine Learning Framework for the Energy Saving of Building HVAC System via Big Data Analytics on the Fusion of BMS and Environmental Data [12.681421165031576]
Energy saving and carbon emission reduction in buildings is one of the key measures in combating climate change. The optimization of chiller system power consumption had been extensively studied in the mechanical engineering and building service domains. With the advance of big data and AI, the adoption of machine learning into the optimization problems becomes popular.
arXiv Detail & Related papers (2020-10-09T09:51:03Z)
Risk-Aware Energy Scheduling for Edge Computing with Microgrid: A Multi-Agent Deep Reinforcement Learning Approach [82.6692222294594]
We study a risk-aware energy scheduling problem for a microgrid-powered MEC network. We derive the solution by applying a multi-agent deep reinforcement learning (MADRL)-based advantage actor-critic (A3C) algorithm with shared neural networks.
arXiv Detail & Related papers (2020-02-21T02:14:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.