Zero-Space Cost Fault Tolerance for Transformer-based Language Models on
ReRAM
- URL: http://arxiv.org/abs/2401.11664v1
- Date: Mon, 22 Jan 2024 02:50:38 GMT
- Title: Zero-Space Cost Fault Tolerance for Transformer-based Language Models on
ReRAM
- Authors: Bingbing Li, Geng Yuan, Zigeng Wang, Shaoyi Huang, Hongwu Peng, Payman
Behnam, Wujie Wen, Hang Liu and Caiwen Ding
- Abstract summary: Resistive Random Access Memory (ReRAM) has emerged as a promising platform for deep neural networks (DNNs)
Hardware failures, such as stuck-at-fault defects, can result in significant prediction errors during model inference.
We propose a fault protection mechanism that incurs zero space cost.
- Score: 27.354689865791638
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Resistive Random Access Memory (ReRAM) has emerged as a promising platform
for deep neural networks (DNNs) due to its support for parallel in-situ
matrix-vector multiplication. However, hardware failures, such as
stuck-at-fault defects, can result in significant prediction errors during
model inference. While additional crossbars can be used to address these
failures, they come with storage overhead and are not efficient in terms of
space, energy, and cost. In this paper, we propose a fault protection mechanism
that incurs zero space cost. Our approach includes: 1) differentiable structure
pruning of rows and columns to reduce model redundancy, 2) weight duplication
and voting for robust output, and 3) embedding duplicated most significant bits
(MSBs) into the model weight. We evaluate our method on nine tasks of the GLUE
benchmark with the BERT model, and experimental results prove its
effectiveness.
Related papers
- Two Sparse Matrices are Better than One: Sparsifying Neural Networks with Double Sparse Factorization [0.0]
We present Double Sparse Factorization (DSF), where we factorize each weight matrix into two sparse matrices.
Our method achieves state-of-the-art results, enabling unprecedented sparsification of neural networks.
arXiv Detail & Related papers (2024-09-27T15:48:39Z) - ALBERTA: ALgorithm-Based Error Resilience in Transformer Architectures [5.502117675161604]
Vision Transformers are being increasingly deployed in safety-critical applications that demand high reliability.
It is crucial to ensure the correctness of their execution in spite of potential errors such as transient hardware errors.
We propose an algorithm-based resilience framework called ALBERTA that allows us to perform end-to-end resilience analysis.
arXiv Detail & Related papers (2023-10-05T18:55:30Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Winner-Take-All Column Row Sampling for Memory Efficient Adaptation of Language Model [89.8764435351222]
We propose a new family of unbiased estimators called WTA-CRS, for matrix production with reduced variance.
Our work provides both theoretical and experimental evidence that, in the context of tuning transformers, our proposed estimators exhibit lower variance compared to existing ones.
arXiv Detail & Related papers (2023-05-24T15:52:08Z) - GBSVM: Granular-ball Support Vector Machine [46.60182022640765]
GBSVM is a significant attempt to construct a classifier using the coarse-to-fine granularity of a granular-ball as input, rather than a single data point.
This paper has fixed the errors of the original model of the existing GBSVM, and derived its dual model.
The experimental results on the UCI benchmark datasets demonstrate that GBSVM has good robustness and efficiency.
arXiv Detail & Related papers (2022-10-06T09:11:44Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z) - Discriminative-Generative Dual Memory Video Anomaly Detection [81.09977516403411]
Recently, people tried to use a few anomalies for video anomaly detection (VAD) instead of only normal data during the training process.
We propose a DiscRiminative-gEnerative duAl Memory (DREAM) anomaly detection model to take advantage of a few anomalies and solve data imbalance.
arXiv Detail & Related papers (2021-04-29T15:49:01Z) - Efficient pre-training objectives for Transformers [84.64393460397471]
We study several efficient pre-training objectives for Transformers-based models.
We prove that eliminating the MASK token and considering the whole output during the loss are essential choices to improve performance.
arXiv Detail & Related papers (2021-04-20T00:09:37Z) - SmartDeal: Re-Modeling Deep Network Weights for Efficient Inference and
Training [82.35376405568975]
Deep neural networks (DNNs) come with heavy parameterization, leading to external dynamic random-access memory (DRAM) for storage.
We present SmartDeal (SD), an algorithm framework to trade higher-cost memory storage/access for lower-cost computation.
We show that SD leads to 10.56x and 4.48x reduction in the storage and training energy, with negligible accuracy loss compared to state-of-the-art training baselines.
arXiv Detail & Related papers (2021-01-04T18:54:07Z) - FTBNN: Rethinking Non-linearity for 1-bit CNNs and Going Beyond [23.5996182207431]
We show that binarized convolution process owns an increasing linearity towards the target of minimizing such error, which in turn hampers BNN's discriminative ability.
We re-investigate and tune proper non-linear modules to fix that contradiction, leading to a strong baseline which achieves state-of-the-art performance.
arXiv Detail & Related papers (2020-10-19T08:11:48Z) - Wide Boosting [0.0]
This paper presents a simple adjustment to Gradient Boosting motivated in part by artificial neural networks.
We call our method Wide Boosting (WB) and show that WB outperforms GB on mult-dimesional output tasks.
arXiv Detail & Related papers (2020-07-20T02:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.