Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers
- URL: http://arxiv.org/abs/2507.16676v1
- Date: Tue, 22 Jul 2025 15:11:13 GMT
- Title: Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers
- Authors: Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos,
- Abstract summary: Transformers and large language models (LLMs) have transformed numerous AI applications, driving the need for specialized hardware accelerators.<n>A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults.<n>Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism.<n>This work proposes Flash-ABFT, a novel method that computes an online normalization across the entire three-matrix product of query, key and value matrices, of an attention layer, with a single check.<n>Results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead
- Score: 3.668018928502405
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.
Related papers
- Spark Transformer: Reactivating Sparsity in FFN and Attention [63.20677098823873]
We introduce Spark Transformer, a novel architecture that achieves a high level of activation sparsity in both FFN and the attention mechanism.<n>This sparsity translates to a 2.5x reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79x on CPU and 1.40x on GPU.
arXiv Detail & Related papers (2025-06-07T03:51:13Z) - Periodic Online Testing for Sparse Systolic Tensor Arrays [0.0]
Modern Machine Learning (ML) applications often benefit from structured sparsity, a technique that efficiently reduces model complexity and simplifies handling of sparse data in hardware.<n>This paper introduces an online error-checking technique capable of detecting and locating permanent faults within sparse systolic tensor arrays before vectors begin.
arXiv Detail & Related papers (2025-04-25T18:10:45Z) - FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention [5.044679241062448]
Transformer models leverage self-attention mechanisms to capture dependencies, demonstrating exceptional performance in various applications.<n>Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead.<n>We propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention.
arXiv Detail & Related papers (2025-04-03T02:05:08Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Dynamic Range Reduction via Branch-and-Bound [1.533133219129073]
Key strategy to enhance hardware accelerators is the reduction of precision in arithmetic operations.
This paper introduces a fully principled Branch-and-Bound algorithm for reducing precision needs in QUBO problems.
Experiments validate our algorithm's effectiveness on an actual quantum annealer.
arXiv Detail & Related papers (2024-09-17T03:07:56Z) - Characterizing Coherent Errors using Matrix-Element Amplification [0.27907340310431333]
Matrix-Element Amplification using Dynamical Decoupling (MEADD)<n>We experimentally demonstrate that MEADD surpasses the accuracy and precision of existing characterization protocols for estimating systematic errors in single- and two-qubit gates.<n>We also use it to characterize coherent crosstalk in the processor which was previously too small to detect reliably.
arXiv Detail & Related papers (2024-04-19T00:05:10Z) - ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures [5.502117675161604]
Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability.
It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors.
We propose an algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis.
arXiv Detail & Related papers (2023-10-05T18:55:30Z) - Fast Flux-Activated Leakage Reduction for Superconducting Quantum
Circuits [84.60542868688235]
leakage out of the computational subspace arising from the multi-level structure of qubit implementations.
We present a resource-efficient universal leakage reduction unit for superconducting qubits using parametric flux modulation.
We demonstrate that using the leakage reduction unit in repeated weight-two stabilizer measurements reduces the total number of detected errors in a scalable fashion.
arXiv Detail & Related papers (2023-09-13T16:21:32Z) - A Robust and Explainable Data-Driven Anomaly Detection Approach For
Power Electronics [56.86150790999639]
We present two anomaly detection and classification approaches, namely the Matrix Profile algorithm and anomaly transformer.
The Matrix Profile algorithm is shown to be well suited as a generalizable approach for detecting real-time anomalies in streaming time-series data.
A series of custom filters is created and added to the detector to tune its sensitivity, recall, and detection accuracy.
arXiv Detail & Related papers (2022-09-23T06:09:35Z) - Softmax-free Linear Transformers [90.83157268265654]
Vision transformers (ViTs) have pushed the state-of-the-art for visual perception tasks.
Existing methods are either theoretically flawed or empirically ineffective for visual recognition.
We propose a family of Softmax-Free Transformers (SOFT)
arXiv Detail & Related papers (2022-07-05T03:08:27Z) - Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z) - Towards Online Monitoring and Data-driven Control: A Study of
Segmentation Algorithms for Laser Powder Bed Fusion Processes [83.97264034062673]
An increasing number of laser powder bed fusion machines use off-axis infrared cameras to improve online monitoring and data-driven control capabilities.
We study over 30 segmentation algorithms that segment each infrared image into a foreground and background.
The identified algorithms can be readily applied to the laser powder bed fusion machines to address each of the above limitations and thus, significantly improve process control.
arXiv Detail & Related papers (2020-11-18T03:30:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.