Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM
Cells for Embedded FPGAs
- URL: http://arxiv.org/abs/2310.16842v2
- Date: Sat, 25 Nov 2023 14:27:39 GMT
- Title: Enhancing Energy-efficiency by Solving the Throughput Bottleneck of LSTM
Cells for Embedded FPGAs
- Authors: Chao Qian, Tianheng Ling, Gregor Schiele
- Abstract summary: This work proposes a novel LSTM cell optimisation aimed at energy-efficient inference on end devices.
It achieves at least 5.4$times$ faster throughput and 1.37$times$ more energy efficient than existing approaches.
- Score: 22.293462679874008
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To process sensor data in the Internet of Things(IoTs), embedded deep
learning for 1-dimensional data is an important technique. In the past, CNNs
were frequently used because they are simple to optimise for special embedded
hardware such as FPGAs. This work proposes a novel LSTM cell optimisation aimed
at energy-efficient inference on end devices. Using the traffic speed
prediction as a case study, a vanilla LSTM model with the optimised LSTM cell
achieves 17534 inferences per second while consuming only 3.8 $\mu$J per
inference on the FPGA XC7S15 from Spartan-7 family. It achieves at least
5.4$\times$ faster throughput and 1.37$\times$ more energy efficient than
existing approaches.
Related papers
- Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models.
We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z) - rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA [0.0]
This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA.
We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code.
Our method uses trained regression models for immediate pre-synthesis predictions.
arXiv Detail & Related papers (2024-08-09T19:35:10Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - SWAT: Scalable and Efficient Window Attention-based Transformers Acceleration on FPGAs [3.302913401404089]
Sliding window-based static sparse attention mitigates the problem by limiting the attention scope of the input tokens.
We propose a dataflow-aware FPGA-based accelerator design, SWAT, that efficiently leverages the sparsity to achieve scalable performance for long input.
arXiv Detail & Related papers (2024-05-27T10:25:08Z) - HiRE: High Recall Approximate Top-$k$ Estimation for Efficient LLM
Inference [68.59839755875252]
HiRE comprises of two novel components: (i) a compression scheme to cheaply predict top-$k$ rows/columns with high recall, followed by full computation restricted to the predicted subset, and (ii) DA-TOP-$k$: an efficient multi-device approximate top-$k$ operator.
We demonstrate that on a one billion parameter model, HiRE applied to both the softmax as well as feedforward layers, achieves almost matching pretraining and downstream accuracy, and speeds up inference latency by $1.47times$ on a single TPUv5e device.
arXiv Detail & Related papers (2024-02-14T18:04:36Z) - Understanding the Potential of FPGA-Based Spatial Acceleration for Large Language Model Inference [11.614722231006695]
Large language models (LLMs) boasting billions of parameters have generated a significant demand for efficient deployment in inference workloads.
This paper investigates the feasibility and potential of model-specific spatial acceleration for LLM inference on FPGAs.
arXiv Detail & Related papers (2023-12-23T04:27:06Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Effective Pre-Training Objectives for Transformer-based Autoencoders [97.99741848756302]
We study trade-offs between efficiency, cost and accuracy of Transformer encoders.
We combine features of common objectives and create new effective pre-training approaches.
arXiv Detail & Related papers (2022-10-24T18:39:44Z) - Spartus: A 9.4 TOp/s FPGA-based LSTM Accelerator Exploiting
Spatio-temporal Sparsity [16.33285645435743]
We present a new accelerator called "Spartus" that exploits sequential-temporal sparsity to achieve ultra-low latency inference.
Spartus achieved 9.4 TOp/s effective batch-1 throughput and 1.1 TOp/RU energy efficiency, which are respectively 4X and 7X higher than the previous state-of-the-art.
arXiv Detail & Related papers (2021-08-04T22:02:14Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.