Analytics of Longitudinal System Monitoring Data for Performance Prediction
- URL: http://arxiv.org/abs/2007.03451v2
- Date: Tue, 2 Jul 2024 17:02:59 GMT
- Title: Analytics of Longitudinal System Monitoring Data for Performance Prediction
- Authors: Ian J. Costello, Abhinav Bhatele,
- Abstract summary: We create data-driven models that can predict the performance of jobs waiting in the scheduler queue.
We analyze these prediction models in great detail to identify the features that are dominant predictors of performance.
We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.
- Score: 0.7004662712880577
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of jobs waiting in the scheduler queue. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data and machine learning to explore the causes of performance variability. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.
Related papers
- Tracing Optimization for Performance Modeling and Regression Detection [15.99435412859094]
A performance model analytically describes the relationship between the performance of a system and its runtime activities.
We propose statistical approaches to reduce tracing overhead by identifying and excluding performance-insensitive code regions.
Our approach is fully automated, making it ready to be used in production environments with minimal human effort.
arXiv Detail & Related papers (2024-11-26T16:11:55Z) - Application Research On Real-Time Perception Of Device Performance Status [9.145804504353125]
The ability of performance features and profiles to describe the real-time performance status of devices was studied.
The accuracy of device performance status identification and prediction was compared with the performance of the profile features.
arXiv Detail & Related papers (2024-09-05T03:32:39Z) - Explanatory Model Monitoring to Understand the Effects of Feature Shifts on Performance [61.06245197347139]
We propose a novel approach to explain the behavior of a black-box model under feature shifts.
We refer to our method that combines concepts from Optimal Transport and Shapley Values as Explanatory Performance Estimation.
arXiv Detail & Related papers (2024-08-24T18:28:19Z) - Third-Party Language Model Performance Prediction from Instruction [59.574169249307054]
Language model-based instruction-following systems have lately shown increasing performance on many benchmark tasks.
A user may easily prompt a model with an instruction without any idea of whether the responses should be expected to be accurate.
We propose a third party performance prediction framework, where a separate model is trained to predict the metric resulting from evaluating an instruction-following system on a task.
arXiv Detail & Related papers (2024-03-19T03:53:47Z) - Assessing the Generalizability of a Performance Predictive Model [0.6070952062639761]
We propose a workflow to estimate the generalizability of a predictive model for algorithm performance.
The results show that generalizability patterns in the landscape feature space are reflected in the performance space.
arXiv Detail & Related papers (2023-05-31T12:50:44Z) - Improved Fine-tuning by Leveraging Pre-training Data: Theory and
Practice [52.11183787786718]
Fine-tuning a pre-trained model on the target data is widely used in many deep learning applications.
Recent studies have empirically shown that training from scratch has the final performance that is no worse than this pre-training strategy.
We propose a novel selection strategy to select a subset from pre-training data to help improve the generalization on the target task.
arXiv Detail & Related papers (2021-11-24T06:18:32Z) - AI Total: Analyzing Security ML Models with Imperfect Data in Production [2.629585075202626]
Development of new machine learning models is typically done on manually curated data sets.
We develop a web-based visualization system that allows the users to quickly gather headline performance numbers.
It also enables the users to immediately observe the root cause of an issue when something goes wrong.
arXiv Detail & Related papers (2021-10-13T20:56:05Z) - Towards More Fine-grained and Reliable NLP Performance Prediction [85.78131503006193]
We make two contributions to improving performance prediction for NLP tasks.
First, we examine performance predictors for holistic measures of accuracy like F1 or BLEU.
Second, we propose methods to understand the reliability of a performance prediction model from two angles: confidence intervals and calibration.
arXiv Detail & Related papers (2021-02-10T15:23:20Z) - Monitoring and explainability of models in production [58.720142291102135]
Monitoring deployed models is crucial for continued provision of high quality machine learning enabled services.
We discuss the challenges to successful implementation of solutions in each of these areas with some recent examples of production ready solutions using open source tools.
arXiv Detail & Related papers (2020-07-13T10:37:05Z) - How Training Data Impacts Performance in Learning-based Control [67.7875109298865]
This paper derives an analytical relationship between the density of the training data and the control performance.
We formulate a quality measure for the data set, which we refer to as $rho$-gap.
We show how the $rho$-gap can be applied to a feedback linearizing control law.
arXiv Detail & Related papers (2020-05-25T12:13:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.