Related papers: DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model

DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model

URL: http://arxiv.org/abs/2212.01302v1
Date: Fri, 2 Dec 2022 16:51:58 GMT
Title: DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep Surrogate Model
Authors: Shreshth Tuli and Giuliano Casale and Ludmila Cherkasova and Nicholas R. Jennings
Abstract summary: We propose DeepFT to proactively avoid system overloads and their adverse effects. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
Score: 12.335763358698564
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The emergence of latency-critical AI applications has been supported by the evolution of the edge computing paradigm. However, edge solutions are typically resource-constrained, posing reliability challenges due to heightened contention for compute and communication capacities and faulty application behavior in the presence of overload conditions. Although a large amount of generated log data can be mined for fault prediction, labeling this data for training is a manual process and thus a limiting factor for automation. Due to this, many companies resort to unsupervised fault-tolerance models. Yet, failure models of this kind can incur a loss of accuracy when they need to adapt to non-stationary workloads and diverse host characteristics. To cope with this, we propose a novel modeling approach, called DeepFT, to proactively avoid system overloads and their adverse effects by optimizing the task scheduling and migration decisions. DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system and co-simulation based self-supervised learning to dynamically adapt the model in volatile settings. It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts. Extensive experimentation on a Raspberry-Pi based edge cluster with DeFog benchmarks shows that DeepFT can outperform state-of-the-art baseline methods in fault-detection and QoS metrics. Specifically, DeepFT gives the highest F1 scores for fault-detection, reducing service deadline violations by up to 37\% while also improving response time by up to 9%.

Related papers

Adaptive Fault Tolerance Mechanisms of Large Language Models in Cloud Computing Environments [5.853391005435494]
This study proposes a novel adaptive fault tolerance mechanism to ensure the reliability and availability of large-scale language models in cloud computing scenarios. It builds upon known fault-tolerant mechanisms, such as checkpointing, redundancy, and state transposition, introducing dynamic resource allocation and prediction of failure based on real-time performance metrics.
arXiv Detail & Related papers (2025-03-15T18:45:33Z)
Model-Agnostic Knowledge Guided Correction for Improved Neural Surrogate Rollout [3.006104092368596]
We propose a model-agnostic, RL based, cost-aware model which combines a neural surrogate, RL decision model, and a physics simulator to reduce surrogate rollout error significantly. HyPER learns an intelligent policy that is adaptable to changing physical conditions and resistant to noise corruption.
arXiv Detail & Related papers (2025-03-13T05:00:23Z)
Towards Resource-Efficient Federated Learning in Industrial IoT for Multivariate Time Series Analysis [50.18156030818883]
Anomaly and missing data constitute a thorny problem in industrial applications. Deep learning enabled anomaly detection has emerged as a critical direction. The data collected in edge devices contain user privacy.
arXiv Detail & Related papers (2024-11-06T15:38:31Z)
Three-Stage Adjusted Regression Forecasting (TSARF) for Software Defect Prediction [5.826476252191368]
Nonhomogeneous Poisson process (NHPP) SRGM are the most commonly employed models. Increased model complexity presents a challenge in identifying robust and computationally efficient algorithms.
arXiv Detail & Related papers (2024-01-31T02:19:35Z)
EdgeFD: An Edge-Friendly Drift-Aware Fault Diagnosis System for Industrial IoT [0.0]
We propose the Drift-Aware Weight Consolidation (DAWC) to mitigate the challenges posed by frequent data drift in the industrial Internet of Things (IIoT) DAWC efficiently manages multiple data drift scenarios, minimizing the need for constant model fine-tuning on edge devices. We have also developed a comprehensive diagnosis and visualization platform.
arXiv Detail & Related papers (2023-10-07T06:48:07Z)
Learning Sample Difficulty from Pre-trained Models for Reliable Prediction [55.77136037458667]
We propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. We simultaneously improve accuracy and uncertainty calibration across challenging benchmarks.
arXiv Detail & Related papers (2023-04-20T07:29:23Z)
Deep Convolutional Architectures for Extrapolative Forecast in Time-dependent Flow Problems [0.0]
Deep learning techniques are employed to model the system dynamics for advection dominated problems. These models take as input a sequence of high-fidelity vector solutions for consecutive time-steps obtained from the PDEs. Non-intrusive reduced-order modelling techniques such as deep auto-encoder networks are utilized to compress the high-fidelity snapshots.
arXiv Detail & Related papers (2022-09-18T03:45:56Z)
Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine. These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults. We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z)
On Efficient Uncertainty Estimation for Resource-Constrained Mobile Applications [0.0]
Predictive uncertainty supplements model predictions and enables improved functionality of downstream tasks. We tackle this problem by building upon Monte Carlo Dropout (MCDO) models using the Axolotl framework. We conduct experiments on (1) a multi-class classification task using the CIFAR10 dataset, and (2) a more complex human body segmentation task.
arXiv Detail & Related papers (2021-11-11T22:24:15Z)
Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC) We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer. Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z)
Adaptive Subcarrier, Parameter, and Power Allocation for Partitioned Edge Learning Over Broadband Channels [69.18343801164741]
partitioned edge learning (PARTEL) implements parameter-server training, a well known distributed learning method, in wireless network. We consider the case of deep neural network (DNN) models which can be trained using PARTEL by introducing some auxiliary variables.
arXiv Detail & Related papers (2020-10-08T15:27:50Z)
Uncertainty Estimation Using a Single Deep Deterministic Neural Network [66.26231423824089]
We propose a method for training a deterministic deep model that can find and reject out of distribution data points at test time with a single forward pass. We scale training in these with a novel loss function and centroid updating scheme and match the accuracy of softmax models.
arXiv Detail & Related papers (2020-03-04T12:27:36Z)

This list is automatically generated from the titles and abstracts of the papers in this site.