BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio
Sparsification
- URL: http://arxiv.org/abs/2101.02667v1
- Date: Thu, 7 Jan 2021 18:23:48 GMT
- Title: BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio
Sparsification
- Authors: Seyed Abolfazl Ghasemzadeh, Erfan Bank Tavakoli, Mehdi Kamal, Ali
Afzali-Kusha, Massoud Pedram
- Abstract summary: A hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented.
Results show that the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.
- Score: 3.3711251611130337
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, first, a hardware-friendly pruning algorithm for reducing
energy consumption and improving the speed of Long Short-Term Memory (LSTM)
neural network accelerators is presented. Next, an FPGA-based platform for
efficient execution of the pruned networks based on the proposed algorithm is
introduced. By considering the sensitivity of two weight matrices of the LSTM
models in pruning, different sparsity ratios (i.e., dual-ratio sparsity) are
applied to these weight matrices. To reduce memory accesses, a row-wise
sparsity pattern is adopted. The proposed hardware architecture makes use of
computation overlapping and pipelining to achieve low-power and high-speed. The
effectiveness of the proposed pruning algorithm and accelerator is assessed
under some benchmarks for natural language processing, binary sentiment
classification, and speech recognition. Results show that, e.g., compared to a
recently published work in this field, the proposed accelerator could provide
up to 272% higher effective GOPS/W and the perplexity error is reduced by up to
1.4% for the PTB dataset.
Related papers
- Expanding Sparse Tuning for Low Memory Usage [103.43560327427647]
We propose a method named SNELL (Sparse tuning with kerNELized LoRA) for sparse tuning with low memory usage.
To achieve low memory usage, SNELL decomposes the tunable matrix for sparsification into two learnable low-rank matrices.
A competition-based sparsification mechanism is further proposed to avoid the storage of tunable weight indexes.
arXiv Detail & Related papers (2024-11-04T04:58:20Z) - Search for Efficient Large Language Models [52.98684997131108]
Large Language Models (LLMs) have long held sway in the realms of artificial intelligence research.
Weight pruning, quantization, and distillation have been embraced to compress LLMs, targeting memory reduction and inference acceleration.
Most model compression techniques concentrate on weight optimization, overlooking the exploration of optimal architectures.
arXiv Detail & Related papers (2024-09-25T21:32:12Z) - FireFly: A High-Throughput Hardware Accelerator for Spiking Neural
Networks with Efficient DSP and Memory Optimization [6.966706170499345]
Spiking neural networks (SNNs) have been widely used due to their strong biological interpretability and high energy efficiency.
Most SNN hardware implementations for field-programmable gate arrays (FPGAs) cannot meet arithmetic or memory efficiency requirements.
We propose an FPGA accelerator that can process spikes generated by the firing neuron on-the-fly (FireFly)
arXiv Detail & Related papers (2023-01-05T04:28:07Z) - Signed Binary Weight Networks [17.07866119979333]
Two important algorithmic techniques have shown promise for enabling efficient inference - sparsity and binarization.
We propose a new method called signed-binary networks to improve efficiency further.
Our method achieves comparable accuracy on ImageNet and CIFAR10 datasets with binary and can lead to 69% sparsity.
arXiv Detail & Related papers (2022-11-25T00:19:21Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Federated Learning for Energy-limited Wireless Networks: A Partial Model
Aggregation Approach [79.59560136273917]
limited communication resources, bandwidth and energy, and data heterogeneity across devices are main bottlenecks for federated learning (FL)
We first devise a novel FL framework with partial model aggregation (PMA)
The proposed PMA-FL improves 2.72% and 11.6% accuracy on two typical heterogeneous datasets.
arXiv Detail & Related papers (2022-04-20T19:09:52Z) - Design Space Exploration of Dense and Sparse Mapping Schemes for RRAM
Architectures [2.788414791586367]
We present an extended Design Space Exploration methodology to quantify the benefits and limitations of dense and sparse mapping schemes.
We also present a case study quantifying and formalizing the trade-offs of typical non-idealities introduced into 1-Transistor-1-Resistor (1T1R) tiled memristive architectures.
arXiv Detail & Related papers (2022-01-18T02:16:10Z) - Unfolding Projection-free SDP Relaxation of Binary Graph Classifier via
GDPA Linearization [59.87663954467815]
Algorithm unfolding creates an interpretable and parsimonious neural network architecture by implementing each iteration of a model-based algorithm as a neural layer.
In this paper, leveraging a recent linear algebraic theorem called Gershgorin disc perfect alignment (GDPA), we unroll a projection-free algorithm for semi-definite programming relaxation (SDR) of a binary graph.
Experimental results show that our unrolled network outperformed pure model-based graph classifiers, and achieved comparable performance to pure data-driven networks but using far fewer parameters.
arXiv Detail & Related papers (2021-09-10T07:01:15Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z) - Non-Volatile Memory Array Based Quantization- and Noise-Resilient LSTM
Neural Networks [1.5332481598232224]
This paper focuses on the application of quantization-aware training algorithm to LSTM models.
We have shown that only 4-bit NVM weights and 4-bit ADC/DACs are needed to produce equivalent LSTM network performance as floating-point baseline.
Benchmark analysis of our proposed LSTM accelerator for inference has shown at least 2.4x better computing efficiency and 40x higher area efficiency than traditional digital approaches.
arXiv Detail & Related papers (2020-02-25T02:59:45Z) - SPEC2: SPECtral SParsE CNN Accelerator on FPGAs [31.31419913907224]
We propose SPEC2 -- the first work to prune and accelerate spectral CNNs.
We design an optimized pipeline architecture on FPGA that has efficient random access into sparse kernels.
The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
arXiv Detail & Related papers (2019-10-16T23:30:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.