A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of
Multifidelity HPC Systems
- URL: http://arxiv.org/abs/2306.09457v1
- Date: Thu, 15 Jun 2023 19:23:50 GMT
- Title: A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of
Multifidelity HPC Systems
- Authors: Shilpika, Bethany Lusch, Murali Emani, Filippo Simini, Venkatram
Vishwanath, Michael E. Papka, and Kwan-Liu Ma
- Abstract summary: Hardware system events and behaviors are crucial to improving the robustness and reliability of these systems.
In this work, we aim to build a holistic analytical system that helps make sense of such massive data.
This end-to-end log analysis system, coupled with visual analytics support, allows users to glean and promptly extract supercomputer usage and error patterns.
- Score: 17.246865176910045
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The ability to monitor and interpret of hardware system events and behaviors
are crucial to improving the robustness and reliability of these systems,
especially in a supercomputing facility. The growing complexity and scale of
these systems demand an increase in monitoring data collected at multiple
fidelity levels and varying temporal resolutions. In this work, we aim to build
a holistic analytical system that helps make sense of such massive data, mainly
the hardware logs, job logs, and environment logs collected from disparate
subsystems and components of a supercomputer system. This end-to-end log
analysis system, coupled with visual analytics support, allows users to glean
and promptly extract supercomputer usage and error patterns at varying temporal
and spatial resolutions. We use multiresolution dynamic mode decomposition
(mrDMD), a technique that depicts high-dimensional data as correlated
spatial-temporal variations patterns or modes, to extract variation patterns
isolated at specified frequencies. Our improvements to the mrDMD algorithm help
promptly reveal useful information in the massive environment log dataset,
which is then associated with the processed hardware and job log datasets using
our visual analytics system. Furthermore, our system can identify the usage and
error patterns filtered at user, project, and subcomponent levels. We exemplify
the effectiveness of our approach with two use scenarios with the Cray XC40
supercomputer.
Related papers
- Enabling High Data Throughput Reinforcement Learning on GPUs: A Domain Agnostic Framework for Data-Driven Scientific Research [90.91438597133211]
We introduce WarpSci, a framework designed to overcome crucial system bottlenecks in the application of reinforcement learning.
We eliminate the need for data transfer between the CPU and GPU, enabling the concurrent execution of thousands of simulations.
arXiv Detail & Related papers (2024-08-01T21:38:09Z) - Ensemble Method for System Failure Detection Using Large-Scale Telemetry Data [0.0]
This research paper presents an in-depth analysis of extensive system telemetry data, proposing an ensemble methodology for detecting system failures.
The proposed ensemble technique integrates a diverse set of algorithms, including Long Short-Term Memory (LSTM) networks, isolation forests, one-class support vector machines (OCSVM), and local outlier factors (LOF)
Experimental evaluations demonstrate the remarkable efficacy of our models, achieving a notable detection rate in identifying system failures.
arXiv Detail & Related papers (2024-06-07T06:35:17Z) - Random resistive memory-based deep extreme point learning machine for
unified visual processing [67.51600474104171]
We propose a novel hardware-software co-design, random resistive memory-based deep extreme point learning machine (DEPLM)
Our co-design system achieves huge energy efficiency improvements and training cost reduction when compared to conventional systems.
arXiv Detail & Related papers (2023-12-14T09:46:16Z) - GLAD: Content-aware Dynamic Graphs For Log Anomaly Detection [49.9884374409624]
GLAD is a Graph-based Log Anomaly Detection framework designed to detect anomalies in system logs.
We introduce GLAD, a Graph-based Log Anomaly Detection framework designed to detect anomalies in system logs.
arXiv Detail & Related papers (2023-09-12T04:21:30Z) - InVAErt networks: a data-driven framework for model synthesis and
identifiability analysis [0.0]
inVAErt is a framework for data-driven analysis and synthesis of physical systems.
It uses a deterministic decoder to represent the forward and inverse maps, a normalizing flow to capture the probabilistic distribution of system outputs, and a variational encoder to learn a compact latent representation for the lack of bijectivity between inputs and outputs.
arXiv Detail & Related papers (2023-07-24T07:58:18Z) - A Hierarchical Approach to Conditional Random Fields for System Anomaly
Detection [0.8164433158925593]
Anomaly detection to recognize unusual events in large scale systems is critical in many industries.
A hierarchical approach takes advantage of the implicit relationships in complex systems and localized context.
arXiv Detail & Related papers (2022-10-26T21:02:47Z) - Lightweight Automated Feature Monitoring for Data Streams [1.4658400971135652]
We propose a flexible system, Feature Monitoring (FM), that detects data drifts in such data sets.
It monitors all features that are used by the system, while providing an interpretable features ranking whenever an alarm occurs.
This illustrates how FM eliminates the need to add custom signals to detect specific types of problems and that monitoring the available space of features is often enough.
arXiv Detail & Related papers (2022-07-18T14:38:11Z) - Dynamic Network-Assisted D2D-Aided Coded Distributed Learning [59.29409589861241]
We propose a novel device-to-device (D2D)-aided coded federated learning method (D2D-CFL) for load balancing across devices.
We derive an optimal compression rate for achieving minimum processing time and establish its connection with the convergence time.
Our proposed method is beneficial for real-time collaborative applications, where the users continuously generate training data.
arXiv Detail & Related papers (2021-11-26T18:44:59Z) - DeepTimeAnomalyViz: A Tool for Visualizing and Post-processing Deep
Learning Anomaly Detection Results for Industrial Time-Series [88.12892448747291]
We introduce the DeTAVIZ interface, which is a web browser based visualization tool for quick exploration and assessment of feasibility of DL based anomaly detection in a given problem.
DeTAVIZ allows the user to easily and quickly iterate through multiple post processing options and compare different models, and allows for manual optimisation towards a chosen metric.
arXiv Detail & Related papers (2021-09-21T10:38:26Z) - RGB-D Railway Platform Monitoring and Scene Understanding for Enhanced
Passenger Safety [3.4298729855744026]
This paper proposes a flexible analysis scheme to detect and track humans on a ground plane.
We consider multiple combinations within a set of RGB- and depth-based detection and tracking modalities.
Results indicate that the combined use of depth-based spatial information and learned representations yields substantially enhanced detection and tracking accuracies.
arXiv Detail & Related papers (2021-02-23T14:44:34Z) - Deep Cellular Recurrent Network for Efficient Analysis of Time-Series
Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information.
The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.