Related papers: Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks

URL: http://arxiv.org/abs/2002.10636v1
Date: Tue, 25 Feb 2020 02:59:45 GMT
Title: Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM Neural Networks
Authors: Wen Ma, Pi-Feng Chiu, Won Ho Choi, Minghai Qin, Daniel Bedau, Martin Lueker-Boden
Abstract summary: This paper focuses on the application of quantization-aware training algorithm to LSTM models. We have shown that only 4-bit NVM weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network performance as floating-point baseline. Benchmark analysis of our proposed LSTM accelerator for inference has shown at least 2.4x better computing efficiency and 40x higher area efficiency than traditional digital approaches.
Score: 1.5332481598232224
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In cloud and edge computing models, it is important that compute devices at the edge be as power efficient as possible. Long short-term memory (LSTM) neural networks have been widely used for natural language processing, time series prediction and many other sequential data tasks. Thus, for these applications there is increasing need for low-power accelerators for LSTM model inference at the edge. In order to reduce power dissipation due to data transfers within inference devices, there has been significant interest in accelerating vector-matrix multiplication (VMM) operations using non-volatile memory (NVM) weight arrays. In NVM array-based hardware, reduced bit-widths also significantly increases the power efficiency. In this paper, we focus on the application of quantization-aware training algorithm to LSTM models, and the benefits these models bring in terms of resilience against both quantization error and analog device noise. We have shown that only 4-bit NVM weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network performance as floating-point baseline. Reasonable levels of ADC quantization noise and weight noise can be naturally tolerated within our NVMbased quantized LSTM network. Benchmark analysis of our proposed LSTM accelerator for inference has shown at least 2.4x better computing efficiency and 40x higher area efficiency than traditional digital approaches (GPU, FPGA, and ASIC). Some other novel approaches based on NVM promise to deliver higher computing efficiency (up to 4.7x) but require larger arrays with potential higher error rates.

Related papers

Efficient Deployment of CNN Models on Multiple In-Memory Computing Units [0.0]
In-Memory Computing (IMC) represents a paradigm shift in deep learning acceleration.<n>We introduce the Load-Balance-Longest-Path (LBLP) algorithm for maximizing the processing rate and minimizing latency due to efficient resources utilization.
arXiv Detail & Related papers (2025-10-09T14:03:32Z)
Neuromorphic Wireless Split Computing with Multi-Level Spikes [69.73249913506042]
In neuromorphic computing, spiking neural networks (SNNs) perform inference tasks, offering significant efficiency gains for workloads involving sequential data. Recent advances in hardware and software have demonstrated that embedding a few bits of payload in each spike exchanged between the spiking neurons can further enhance inference accuracy. This paper investigates a wireless neuromorphic split computing architecture employing multi-level SNNs.
arXiv Detail & Related papers (2024-11-07T14:08:35Z)
SpiDR: A Reconfigurable Digital Compute-in-Memory Spiking Neural Network Accelerator for Event-based Perception [8.968583287058959]
Spiking Neural Networks (SNNs) offer an efficient method for processing the asynchronous temporal data generated by Dynamic Vision Sensors (DVS) Existing SNN accelerators suffer from limitations in adaptability to diverse neuron models, bit precisions and network sizes. We propose a scalable and reconfigurable digital compute-in-memory (CIM) SNN accelerator chipname with a set of key features.
arXiv Detail & Related papers (2024-11-05T06:59:02Z)
Not All Attention is Needed: Parameter and Computation Efficient Transfer Learning for Multi-modal Large Language Models [73.48675708831328]
We propose a novel parameter and computation efficient tuning method for Multi-modal Large Language Models (MLLMs) The Efficient Attention Skipping (EAS) method evaluates the attention redundancy and skips the less important MHAs to speed up inference. The experiments show that EAS not only retains high performance and parameter efficiency, but also greatly speeds up inference speed.
arXiv Detail & Related papers (2024-03-22T14:20:34Z)
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation [2.7488316163114823]
This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. On a 16$times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt.
arXiv Detail & Related papers (2023-09-04T19:19:39Z)
Energy Efficient Hardware Acceleration of Neural Networks with Power-of-Two Quantisation [0.0]
We show that a hardware neural network accelerator with PoT weights implemented on the Zynq UltraScale + MPSoC ZCU104 FPGA can be at least $1.4x$ more energy efficient than the uniform quantisation version.
arXiv Detail & Related papers (2022-09-30T06:33:40Z)
Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex. In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z)
RF-Photonic Deep Learning Processor with Shannon-Limited Data Movement [0.0]
Optical neural networks (ONNs) are promising accelerators with ultra-low latency and energy consumption. We introduce our multiplicative analog frequency transform ONN (MAFT-ONN) that encodes the data in the frequency domain. We experimentally demonstrate the first hardware accelerator that computes fully-analog deep learning on raw RF signals.
arXiv Detail & Related papers (2022-07-08T16:37:13Z)
MemSE: Fast MSE Prediction for Noisy Memristor-Based DNN Accelerators [5.553959304125023]
We theoretically analyze the mean squared error of DNNs that use memristors to compute matrix-vector multiplications (MVM) We take into account both the quantization noise, due to the necessity of reducing the DNN model size, and the programming noise, stemming from the variability during the programming of the memristance value. The proposed method is almost two order of magnitude faster than Monte-Carlo simulation, thus making it possible to optimize the implementation parameters to achieve minimal error for a given power constraint.
arXiv Detail & Related papers (2022-05-03T18:10:43Z)
Mixed Precision Low-bit Quantization of Neural Network Language Models for Speech Recognition [67.95996816744251]
State-of-the-art language models (LMs) represented by long-short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming increasingly complex and expensive for practical applications. Current quantization methods are based on uniform precision and fail to account for the varying performance sensitivity at different parts of LMs to quantization errors. Novel mixed precision neural network LM quantization methods are proposed in this paper.
arXiv Detail & Related papers (2021-11-29T12:24:02Z)
Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update [49.948082497688404]
Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts. One promising approach to reduce the energy costs is representing DNNs with low-precision numbers. We jointly design a lowprecision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam.
arXiv Detail & Related papers (2021-06-26T00:32:17Z)
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.