Acela: Predictable Datacenter-level Maintenance Job Scheduling
- URL: http://arxiv.org/abs/2212.05155v1
- Date: Sat, 10 Dec 2022 00:22:49 GMT
- Title: Acela: Predictable Datacenter-level Maintenance Job Scheduling
- Authors: Yi Ding, Aijia Gao, Thibaud Ryden, Kaushik Mitra, Sukumar Kalmanje,
Yanai Golany, Michael Carbin, Henry Hoffmann
- Abstract summary: We present Acela, a machine learning system for predicting maintenance job duration.
We show that Acela reduces the number of servers that are taken offline by 1.87-4.28X, and reduces the server offline time by 1.40-2.80X.
- Score: 27.990173338574138
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Datacenter operators ensure fair and regular server maintenance by using
automated processes to schedule maintenance jobs to complete within a strict
time budget. Automating this scheduling problem is challenging because
maintenance job duration varies based on both job type and hardware. While it
is tempting to use prior machine learning techniques for predicting job
duration, we find that the structure of the maintenance job scheduling problem
creates a unique challenge. In particular, we show that prior machine learning
methods that produce the lowest error predictions do not produce the best
scheduling outcomes due to asymmetric costs. Specifically, underpredicting
maintenance job duration has results in more servers being taken offline and
longer server downtime than overpredicting maintenance job duration. The system
cost of underprediction is much larger than that of overprediction.
We present Acela, a machine learning system for predicting maintenance job
duration, which uses quantile regression to bias duration predictions toward
overprediction. We integrate Acela into a maintenance job scheduler and
evaluate it on datasets from large-scale, production datacenters. Compared to
machine learning based predictors from prior work, Acela reduces the number of
servers that are taken offline by 1.87-4.28X, and reduces the server offline
time by 1.40-2.80X.
Related papers
- SkipPredict: When to Invest in Predictions for Scheduling [10.895221249490984]
We introduce a novel approach to utilizing predictions, SkipPredict, designed to address their inherent cost.
To achieve this, we employ one-bit "cheap predictions" to classify jobs as either short or long.
We examine the effect of this cost for two distinct models.
arXiv Detail & Related papers (2024-02-05T22:24:19Z) - TranDRL: A Transformer-Driven Deep Reinforcement Learning Enabled Prescriptive Maintenance Framework [58.474610046294856]
Industrial systems demand reliable predictive maintenance strategies to enhance operational efficiency and reduce downtime.
This paper introduces an integrated framework that leverages the capabilities of the Transformer model-based neural networks and deep reinforcement learning (DRL) algorithms to optimize system maintenance actions.
arXiv Detail & Related papers (2023-09-29T02:27:54Z) - Learning While Scheduling in Multi-Server Systems with Unknown
Statistics: MaxWeight with Discounted UCB [18.898514227870926]
This paper considers a multi-server system with multiple servers and multiple types of jobs, where different job types require different amounts of processing time at different servers.
The goal is to schedule jobs on servers without knowing the statistics of the processing times.
We propose a new algorithm, which combines the MaxWeight scheduling policy with discounted upper confidence bound (UCB) to simultaneously learn statistics and schedule jobs to servers.
arXiv Detail & Related papers (2022-09-02T15:37:02Z) - Human-in-the-Loop Large-Scale Predictive Maintenance of Workstations [89.51621054382878]
Predictive maintenance (PdM) is the task of scheduling maintenance operations based on a statistical analysis of the system's condition.
We propose a human-in-the-loop PdM approach in which a machine learning system predicts future problems in sets of workstations.
arXiv Detail & Related papers (2022-06-23T09:40:46Z) - Prescriptive maintenance with causal machine learning [4.169130102668252]
We learn the effect of maintenance conditional on a machine's characteristics from observational data on similar machines.
We validate our proposed approach using real-life data on more than 4,000 maintenance contracts from an industrial partner.
arXiv Detail & Related papers (2022-06-03T13:35:57Z) - Predictive Maintenance using Machine Learning [0.0]
Predictive maintenance (PdM) is implemented to effectively manage maintenance plans of the assets.
Data is collected over a certain period of time to monitor the state of equipment.
arXiv Detail & Related papers (2022-05-19T09:05:37Z) - Non-Clairvoyant Scheduling with Predictions Revisited [77.86290991564829]
In non-clairvoyant scheduling, the task is to find an online strategy for scheduling jobs with a priori unknown processing requirements.
We revisit this well-studied problem in a recently popular learning-augmented setting that integrates (untrusted) predictions in algorithm design.
We show that these predictions have desired properties, admit a natural error measure as well as algorithms with strong performance guarantees.
arXiv Detail & Related papers (2022-02-21T13:18:11Z) - Automated Machine Learning Techniques for Data Streams [91.3755431537592]
This paper surveys the state-of-the-art open-source AutoML tools, applies them to data collected from streams, and measures how their performance changes over time.
The results show that off-the-shelf AutoML tools can provide satisfactory results but in the presence of concept drift, detection or adaptation techniques have to be applied to maintain the predictive accuracy over time.
arXiv Detail & Related papers (2021-06-14T11:42:46Z) - A Predictive Autoscaler for Elastic Batch Jobs [8.354712625979776]
Large batch jobs such as Deep Learning, HPC and Spark require far more computational resources and higher cost than conventional online service.
We propose a predictive autoscaler to provide an elastic interface for the customers and overprovision instances.
arXiv Detail & Related papers (2020-10-10T17:35:55Z) - Temporally Correlated Task Scheduling for Sequence Learning [143.70523777803723]
In many applications, a sequence learning task is usually associated with multiple temporally correlated auxiliary tasks.
We introduce a learnable scheduler to sequence learning, which can adaptively select auxiliary tasks for training.
Our method significantly improves the performance of simultaneous machine translation and stock trend forecasting.
arXiv Detail & Related papers (2020-07-10T10:28:54Z) - Predictive Maintenance for Edge-Based Sensor Networks: A Deep
Reinforcement Learning Approach [68.40429597811071]
The risk of unplanned equipment downtime can be minimized through Predictive Maintenance of revenue generating assets.
A model-free Deep Reinforcement Learning algorithm is proposed for predictive equipment maintenance from an equipment-based sensor network context.
Unlike traditional black-box regression models, the proposed algorithm self-learns an optimal maintenance policy and provides actionable recommendation for each equipment.
arXiv Detail & Related papers (2020-07-07T10:00:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.