ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing
- URL: http://arxiv.org/abs/2302.10469v4
- Date: Tue, 11 Mar 2025 16:17:08 GMT
- Title: ApproxABFT: Approximate Algorithm-Based Fault Tolerance for Neural Network Processing
- Authors: Xinghua Xue, Cheng Liu, Feng Min, Tao Luo, Yinhe Han,
- Abstract summary: We propose ApproxABFT, which initiates error recovery only when computational errors are significant.<n>This approach avoids unnecessary recovery procedures, streamlines the error recovery process, and focuses on correcting impactful errors.<n> Experimental results demonstrate that ApproxABFT reduces the computing overhead by 67.83% and improves the tolerable bit error rate by an order of magnitude on average.
- Score: 7.578258600530223
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the growing adoption of neural network models in safety-critical applications such as autonomous driving and robotics, reliability has become a critical metric alongside performance and energy efficiency. Algorithm-based fault tolerance (ABFT) strategies, designed atop standard chip architectures, are both cost-effective and adaptable to different architectures, making them particularly attractive. However, traditional ABFT relies on precise error metrics, triggering error recovery procedures even for minor computational deviations. These minor errors often do not impact the accuracy of the neural network model due to the inherent fault tolerance. To address this inefficiency, we propose an approximate ABFT approach, called ApproxABFT, which initiates error recovery only when computational errors are significant. This approach avoids unnecessary recovery procedures, streamlines the error recovery process, and focuses on correcting impactful errors, ultimately enhancing recovery quality. Additionally, ApproxABFT incorporates a fine-grained blocking strategy to smooth error sensitivity across layers within neural network models. Experimental results demonstrate that ApproxABFT reduces the computing overhead by 67.83\% and improves the tolerable bit error rate by an order of magnitude on average compared to classical accurate ABFT.
Related papers
- Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection [1.474723404975345]
Machine learning-based embedded systems for safety-critical applications must be robust to perturbations caused by soft errors.
We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems.
arXiv Detail & Related papers (2025-04-07T14:21:31Z) - FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention [5.044679241062448]
Transformer models leverage self-attention mechanisms to capture dependencies, demonstrating exceptional performance in various applications.
Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead.
We propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention.
arXiv Detail & Related papers (2025-04-03T02:05:08Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - A constrained optimization approach to improve robustness of neural networks [1.2338729811609355]
We present a novel nonlinear programming-based approach to fine-tune pre-trained neural networks to improve robustness against adversarial attacks while maintaining accuracy on clean data.
arXiv Detail & Related papers (2024-09-18T18:37:14Z) - Cost-Effective Fault Tolerance for CNNs Using Parameter Vulnerability Based Hardening and Pruning [0.4660328753262075]
This paper introduces a model-level hardening approach for CNNs by integrating error correction directly into the neural networks.
The proposed method demonstrates fault resilience nearly equivalent to TMR-based correction but with significantly reduced overhead.
Remarkably, the hardened pruned CNNs perform up to 24% faster than the hardened un-pruned ones.
arXiv Detail & Related papers (2024-05-17T09:42:44Z) - Achieving Constraints in Neural Networks: A Stochastic Augmented
Lagrangian Approach [49.1574468325115]
Regularizing Deep Neural Networks (DNNs) is essential for improving generalizability and preventing overfitting.
We propose a novel approach to DNN regularization by framing the training process as a constrained optimization problem.
We employ the Augmented Lagrangian (SAL) method to achieve a more flexible and efficient regularization mechanism.
arXiv Detail & Related papers (2023-10-25T13:55:35Z) - Guaranteed Approximation Bounds for Mixed-Precision Neural Operators [83.64404557466528]
We build on intuition that neural operator learning inherently induces an approximation error.
We show that our approach reduces GPU memory usage by up to 50% and improves throughput by 58% with little or no reduction in accuracy.
arXiv Detail & Related papers (2023-07-27T17:42:06Z) - APPRAISER: DNN Fault Resilience Analysis Employing Approximation Errors [1.1091582432763736]
Deep Neural Networks (DNNs) in safety-critical applications raise new reliability concerns.
State-of-the-art methods for fault injection by emulation incur a spectrum of time-, design- and control-complexity problems.
APPRAISER is proposed that applies functional approximation for a non-conventional purpose and employs approximate computing errors.
arXiv Detail & Related papers (2023-05-31T10:53:46Z) - Intelligent Proactive Fault Tolerance at the Edge through Resource Usage
Prediction [0.7046417074932255]
We propose an Intelligent Proactive Fault Tolerance (IPFT) method that leverages the edge resource usage predictions through Recurrent Neural Networks (RNN)
In this paper, we focus on the process-faults, which are related with the inability of the infrastructure to provide Quality of Service (QoS) in acceptable ranges due to the lack of processing power.
arXiv Detail & Related papers (2023-02-09T00:42:34Z) - Minimizing Worst-Case Violations of Neural Networks [0.0]
This paper introduces a neural network training procedure designed to achieve both a good average performance and minimum worst-case violations.
We demonstrate the proposed architecture on four different test systems ranging from 39 buses to 162 buses, for both AC-OPF and DC-OPF applications.
arXiv Detail & Related papers (2022-12-21T11:20:12Z) - DeepFT: Fault-Tolerant Edge Computing using a Self-Supervised Deep
Surrogate Model [12.335763358698564]
We propose DeepFT to proactively avoid system overloads and their adverse effects.
DeepFT uses a deep surrogate model to accurately predict and diagnose faults in the system.
It offers a highly scalable solution as the model size scales by only 3 and 1 percent per unit increase in the number of active tasks and hosts.
arXiv Detail & Related papers (2022-12-02T16:51:58Z) - Fast Exploration of the Impact of Precision Reduction on Spiking Neural
Networks [63.614519238823206]
Spiking Neural Networks (SNNs) are a practical choice when the target hardware reaches the edge of computing.
We employ an Interval Arithmetic (IA) model to develop an exploration methodology that takes advantage of the capability of such a model to propagate the approximation error.
arXiv Detail & Related papers (2022-11-22T15:08:05Z) - An Accelerated Doubly Stochastic Gradient Method with Faster Explicit
Model Identification [97.28167655721766]
We propose a novel doubly accelerated gradient descent (ADSGD) method for sparsity regularized loss minimization problems.
We first prove that ADSGD can achieve a linear convergence rate and lower overall computational complexity.
arXiv Detail & Related papers (2022-08-11T22:27:22Z) - Fast and Accurate Error Simulation for CNNs against Soft Errors [64.54260986994163]
We present a framework for the reliability analysis of Conal Neural Networks (CNNs) via an error simulation engine.
These error models are defined based on the corruption patterns of the output of the CNN operators induced by faults.
We show that our methodology achieves about 99% accuracy of the fault effects w.r.t. SASSIFI, and a speedup ranging from 44x up to 63x w.r.t.FI, that only implements a limited set of error models.
arXiv Detail & Related papers (2022-06-04T19:45:02Z) - Robust lEarned Shrinkage-Thresholding (REST): Robust unrolling for
sparse recover [87.28082715343896]
We consider deep neural networks for solving inverse problems that are robust to forward model mis-specifications.
We design a new robust deep neural network architecture by applying algorithm unfolding techniques to a robust version of the underlying recovery problem.
The proposed REST network is shown to outperform state-of-the-art model-based and data-driven algorithms in both compressive sensing and radar imaging problems.
arXiv Detail & Related papers (2021-10-20T06:15:45Z) - Adaptive Anomaly Detection for Internet of Things in Hierarchical Edge
Computing: A Contextual-Bandit Approach [81.5261621619557]
We propose an adaptive anomaly detection scheme with hierarchical edge computing (HEC)
We first construct multiple anomaly detection DNN models with increasing complexity, and associate each of them to a corresponding HEC layer.
Then, we design an adaptive model selection scheme that is formulated as a contextual-bandit problem and solved by using a reinforcement learning policy network.
arXiv Detail & Related papers (2021-08-09T08:45:47Z) - Reduced-Order Neural Network Synthesis with Robustness Guarantees [0.0]
Machine learning algorithms are being adapted to run locally on board, potentially hardware limited, devices to improve user privacy, reduce latency and be more energy efficient.
To address this issue, a method to automatically synthesize reduced-order neural networks (having fewer neurons) approxing the input/output mapping of a larger one is introduced.
Worst-case bounds for this approximation error are obtained and the approach can be applied to a wide variety of neural networks architectures.
arXiv Detail & Related papers (2021-02-18T12:03:57Z) - A Simple Fine-tuning Is All You Need: Towards Robust Deep Learning Via
Adversarial Fine-tuning [90.44219200633286]
We propose a simple yet very effective adversarial fine-tuning approach based on a $textitslow start, fast decay$ learning rate scheduling strategy.
Experimental results show that the proposed adversarial fine-tuning approach outperforms the state-of-the-art methods on CIFAR-10, CIFAR-100 and ImageNet datasets.
arXiv Detail & Related papers (2020-12-25T20:50:15Z) - Amortized Conditional Normalized Maximum Likelihood: Reliable Out of
Distribution Uncertainty Estimation [99.92568326314667]
We propose the amortized conditional normalized maximum likelihood (ACNML) method as a scalable general-purpose approach for uncertainty estimation.
Our algorithm builds on the conditional normalized maximum likelihood (CNML) coding scheme, which has minimax optimal properties according to the minimum description length principle.
We demonstrate that ACNML compares favorably to a number of prior techniques for uncertainty estimation in terms of calibration on out-of-distribution inputs.
arXiv Detail & Related papers (2020-11-05T08:04:34Z) - Combining Deep Learning and Optimization for Security-Constrained
Optimal Power Flow [94.24763814458686]
Security-constrained optimal power flow (SCOPF) is fundamental in power systems.
Modeling of APR within the SCOPF problem results in complex large-scale mixed-integer programs.
This paper proposes a novel approach that combines deep learning and robust optimization techniques.
arXiv Detail & Related papers (2020-07-14T12:38:21Z) - FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural
Networks [13.100954947774163]
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields.
CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage.
Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components.
arXiv Detail & Related papers (2020-03-27T02:01:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.