Vau da muntanialas: Energy-efficient multi-die scalable acceleration of
RNN inference
- URL: http://arxiv.org/abs/2202.07462v1
- Date: Mon, 14 Feb 2022 09:21:16 GMT
- Title: Vau da muntanialas: Energy-efficient multi-die scalable acceleration of
RNN inference
- Authors: Gianna Paulin, Francesco Conti, Lukas Cavigelli, Luca Benini
- Abstract summary: We present Muntaniala, an RNN accelerator architecture for LSTM inference with a silicon-measured energy-efficiency of 3.25$TOP/s/W$.
The scalable design of Muntaniala allows running large RNN models by combining multiple tiles in a systolic array.
We show a phoneme error rate (PER) drop of approximately 3% with respect to floating-point (FP) on a 3L-384NH-123NI LSTM network.
- Score: 18.50014427283814
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recurrent neural networks such as Long Short-Term Memories (LSTMs) learn
temporal dependencies by keeping an internal state, making them ideal for
time-series problems such as speech recognition. However, the output-to-input
feedback creates distinctive memory bandwidth and scalability challenges in
designing accelerators for RNNs. We present Muntaniala, an RNN accelerator
architecture for LSTM inference with a silicon-measured energy-efficiency of
3.25$TOP/s/W$ and performance of 30.53$GOP/s$ in UMC 65 $nm$ technology. The
scalable design of Muntaniala allows running large RNN models by combining
multiple tiles in a systolic array. We keep all parameters stationary on every
die in the array, drastically reducing the I/O communication to only loading
new features and sharing partial results with other dies. For quantifying the
overall system power, including I/O power, we built Vau da Muntanialas, to the
best of our knowledge, the first demonstration of a systolic multi-chip-on-PCB
array of RNN accelerator. Our multi-die prototype performs LSTM inference with
192 hidden states in 330$\mu s$ with a total system power of 9.0$mW$ at 10$MHz$
consuming 2.95$\mu J$. Targeting the 8/16-bit quantization implemented in
Muntaniala, we show a phoneme error rate (PER) drop of approximately 3% with
respect to floating-point (FP) on a 3L-384NH-123NI LSTM network on the TIMIT
dataset.
Related papers
- GhostRNN: Reducing State Redundancy in RNN with Cheap Operations [66.14054138609355]
We propose an efficient RNN architecture, GhostRNN, which reduces hidden state redundancy with cheap operations.
Experiments on KWS and SE tasks demonstrate that the proposed GhostRNN significantly reduces the memory usage (40%) and computation cost while keeping performance similar.
arXiv Detail & Related papers (2024-11-20T11:37:14Z) - Efficient k-Nearest-Neighbor Machine Translation with Dynamic Retrieval [49.825549809652436]
$k$NN-MT constructs an external datastore to store domain-specific translation knowledge.
adaptive retrieval ($k$NN-MT-AR) dynamically estimates $lambda$ and skips $k$NN retrieval if $lambda$ is less than a fixed threshold.
We propose dynamic retrieval ($k$NN-MT-DR) that significantly extends vanilla $k$NN-MT in two aspects.
arXiv Detail & Related papers (2024-06-10T07:36:55Z) - ApproxDARTS: Differentiable Neural Architecture Search with Approximate Multipliers [0.24578723416255746]
We present ApproxDARTS, a neural architecture search (NAS) method enabling the popular differentiable neural architecture search method called DARTS to exploit approximate multipliers.
We show that the ApproxDARTS is able to perform a complete architecture search within less than $10$ GPU hours and produce competitive convolutional neural networks (CNN) containing approximate multipliers in convolutional layers.
arXiv Detail & Related papers (2024-04-08T09:54:57Z) - Stochastic Spiking Attention: Accelerating Attention with Stochastic
Computing in Spiking Networks [33.51445486269896]
Spiking Neural Networks (SNNs) have been recently integrated into Transformer architectures due to their potential to reduce computational demands and to improve power efficiency.
We propose a novel framework leveraging computing (SC) to effectively execute the dot-product attention for SNN-based Transformers.
arXiv Detail & Related papers (2024-02-14T11:47:19Z) - Shared Memory-contention-aware Concurrent DNN Execution for Diversely
Heterogeneous System-on-Chips [0.32634122554914]
HaX-CoNN is a novel scheme that characterizes and maps layers in concurrently executing inference workloads.
We evaluate HaX-CoNN on NVIDIA Orin, NVIDIA Xavier, and Qualcomm Snapdragon 865 SOCs.
arXiv Detail & Related papers (2023-08-10T22:47:40Z) - Fully $1\times1$ Convolutional Network for Lightweight Image
Super-Resolution [79.04007257606862]
Deep models have significant process on single image super-resolution (SISR) tasks, in particular large models with large kernel ($3times3$ or more)
$1times1$ convolutions bring substantial computational efficiency, but struggle with aggregating local spatial representations.
We propose a simple yet effective fully $1times1$ convolutional network, named Shift-Conv-based Network (SCNet)
arXiv Detail & Related papers (2023-07-30T06:24:03Z) - Accurate, Low-latency, Efficient SAR Automatic Target Recognition on
FPGA [3.251765107970636]
Synthetic aperture radar (SAR) automatic target recognition (ATR) is the key technique for remote-sensing image recognition.
The state-of-the-art convolutional neural networks (CNNs) for SAR ATR suffer from emphhigh computation cost and emphlarge memory footprint.
We propose a comprehensive GNN-based model-architecture co-design on FPGA to address the above issues.
arXiv Detail & Related papers (2023-01-04T05:35:30Z) - Bayesian Neural Network Language Modeling for Speech Recognition [59.681758762712754]
State-of-the-art neural network language models (NNLMs) represented by long short term memory recurrent neural networks (LSTM-RNNs) and Transformers are becoming highly complex.
In this paper, an overarching full Bayesian learning framework is proposed to account for the underlying uncertainty in LSTM-RNN and Transformer LMs.
arXiv Detail & Related papers (2022-08-28T17:50:19Z) - Accelerating Inference and Language Model Fusion of Recurrent Neural
Network Transducers via End-to-End 4-bit Quantization [35.198615417316056]
We report on aggressive quantization strategies that greatly accelerate inference of Recurrent Neural Network Transducers (RNN-T)
We use a 4 bit integer representation for both weights and activations and apply Quantization Aware Training (QAT) to retrain the full model.
We show that customized quantization schemes that are tailored to the local properties of the network are essential to achieve good performance.
arXiv Detail & Related papers (2022-06-16T02:17:49Z) - ANNETTE: Accurate Neural Network Execution Time Estimation with Stacked
Models [56.21470608621633]
We propose a time estimation framework to decouple the architectural search from the target hardware.
The proposed methodology extracts a set of models from micro- kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation.
We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation.
arXiv Detail & Related papers (2021-05-07T11:39:05Z) - SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost
Computation [97.78417228445883]
We present SmartExchange, an algorithm- hardware co-design framework for energy-efficient inference of deep neural networks (DNNs)
We develop a novel algorithm to enforce a specially favorable DNN weight structure, where each layerwise weight matrix can be stored as the product of a small basis matrix and a large sparse coefficient matrix whose non-zero elements are all power-of-2.
We further design a dedicated accelerator to fully utilize the SmartExchange-enforced weights to improve both energy efficiency and latency performance.
arXiv Detail & Related papers (2020-05-07T12:12:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.