Related papers: Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM

URL: http://arxiv.org/abs/2401.11664v1
Date: Mon, 22 Jan 2024 02:50:38 GMT
Title: Zero-Space Cost Fault Tolerance for Transformer-based Language Models on ReRAM
Authors: Bingbing Li, Geng Yuan, Zigeng Wang, Shaoyi Huang, Hongwu Peng, Payman Behnam, Wujie Wen, Hang Liu and Caiwen Ding
Abstract summary: Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs) Hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference. We propose a fault protection mechanism that incurs zero space cost.
Score: 27.354689865791638
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs) due to its support for parallel in-situ matrix-vector multiplication. However, hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference. While additional crossbars can be used to address these failures, they come with storage overhead and are not efficient in terms of space, energy, and cost. In this paper, we propose a fault protection mechanism that incurs zero space cost. Our approach includes: 1) differentiable structure pruning of rows and columns to reduce model redundancy, 2) weight duplication and voting for robust output, and 3) embedding duplicated most significant bits (MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE benchmark with the BERT model, and experimental results prove its effectiveness.

Related papers

FT-Transformer: Resilient and Reliable Transformer with End-to-End Fault Tolerant Attention [5.044679241062448]
Transformer models leverage self-attention mechanisms to capture dependencies, demonstrating exceptional performance in various applications. Existing fault tolerance methods protect each operation separately using decoupled kernels, incurring substantial computational and memory overhead. We propose a novel error-resilient framework for Transformer models, integrating end-to-end fault tolerant attention.
arXiv Detail & Related papers (2025-04-03T02:05:08Z)
Performance of the spin qubit shuttling architecture for a surface code implementation [1.2600261666440378]
Qubit shuttling promises to advance some quantum computing platforms to the qubit register sizes needed for effective quantum error correction (QEC) We study two scenarios for shuttling errors, depolarization on the one hand and dephasing on the other. The qubit overhead needed to reach a logical error rate of $10-12$ increases only moderately for shuttling error rates up to about 1 % per shuttling operation.
arXiv Detail & Related papers (2025-03-13T17:46:02Z)
Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices. Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z)
Effective Interplay between Sparsity and Quantization: From Theory to Practice [33.697590845745815]
We show how sparsity and quantization interact when combined together. We show that even if applied in the correct order, the compounded errors from sparsity and quantization can significantly harm accuracy. Our findings extend to the efficient deployment of large models in resource-constrained compute platforms.
arXiv Detail & Related papers (2024-05-31T15:34:13Z)
ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures [5.502117675161604]
Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability. It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors. We propose an algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis.
arXiv Detail & Related papers (2023-10-05T18:55:30Z)
SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference. We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit. Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z)
Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance. Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z)
GBSVM: Granular-ball Support Vector Machine [46.60182022640765]
GBSVM is a significant attempt to construct a classifier using the coarse-to-fine granularity of a granular-ball as input, rather than a single data point. This paper has fixed the errors of the original model of the existing GBSVM, and derived its dual model. The experimental results on the UCI benchmark datasets demonstrate that GBSVM has good robustness and efficiency.
arXiv Detail & Related papers (2022-10-06T09:11:44Z)
Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep. We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
Discriminative-Generative Dual Memory Video Anomaly Detection [81.09977516403411]
Recently, people tried to use a few anomalies for video anomaly detection (VAD) instead of only normal data during the training process. We propose a DiscRiminative-gEnerative duAl Memory (DREAM) anomaly detection model to take advantage of a few anomalies and solve data imbalance.
arXiv Detail & Related papers (2021-04-29T15:49:01Z)
Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models. We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z)
SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage. We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation. We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z)
FTBNN: Rethinking Non-linearity for 1-bit CNNs and Going Beyond [23.5996182207431]
We show that binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability. We re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-10-19T08:11:48Z)
Wide Boosting [0.0]
This paper presents a simple adjustment to Gradient Boosting motivated in part by artificial neural networks. We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks.
arXiv Detail & Related papers (2020-07-20T02:54:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.