Fast convolutional neural networks on FPGAs with hls4ml
- URL: http://arxiv.org/abs/2101.05108v1
- Date: Wed, 13 Jan 2021 14:47:11 GMT
- Title: Fast convolutional neural networks on FPGAs with hls4ml
- Authors: Thea Aarrestad, Vladimir Loncar, Maurizio Pierini, Sioni Summers,
Jennifer Ngadiuba, Christoffer Petersson, Hampus Linander, Yutaro Iiyama,
Giuseppe Di Guglielmo, Javier Duarte, Philip Harris, Dylan Rankin, Sergo
Jindariani, Kevin Pedro, Nhan Tran, Mia Liu, Edward Kreinar, Zhenbin Wu, and
Duc Hoang
- Abstract summary: We introduce an automated tool for deploying ultra low-latency, low-power deep neural networks on FPGAs.
We demonstrate how to achieve inference latency of $5,mu$s using convolutional architectures, while preserving state-of-the-art model performance.
- Score: 0.22756183402372013
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We introduce an automated tool for deploying ultra low-latency, low-power
deep neural networks with large convolutional layers on FPGAs. By extending the
hls4ml library, we demonstrate how to achieve inference latency of $5\,\mu$s
using convolutional architectures, while preserving state-of-the-art model
performance. Considering benchmark models trained on the Street View House
Numbers Dataset, we demonstrate various methods for model compression in order
to fit the computational constraints of a typical FPGA device. In particular,
we discuss pruning and quantization-aware training, and demonstrate how
resource utilization can be reduced by over 90% while maintaining the original
model accuracy.
Related papers
- Compressing Recurrent Neural Networks for FPGA-accelerated Implementation in Fluorescence Lifetime Imaging [3.502427552446068]
Deep learning models enable real-time inference, but can be computationally demanding due to complex architectures and large matrix operations.
This makes DL models ill-suited for direct implementation on field-programmable gate array (FPGA)-based camera hardware.
In this work, we focus on compressing recurrent neural networks (RNNs), which are well-suited for FLI time-series data processing, to enable deployment on resource-constrained FPGA boards.
arXiv Detail & Related papers (2024-10-01T17:23:26Z) - rule4ml: An Open-Source Tool for Resource Utilization and Latency Estimation for ML Models on FPGA [0.0]
This paper introduces a novel method to predict the resource utilization and inference latency of Neural Networks (NNs) before their synthesis and implementation on FPGA.
We leverage HLS4ML, a tool-flow that helps translate NNs into high-level synthesis (HLS) code.
Our method uses trained regression models for immediate pre-synthesis predictions.
arXiv Detail & Related papers (2024-08-09T19:35:10Z) - Iterative Filter Pruning for Concatenation-based CNN Architectures [9.651318927588934]
Modern object detectors have highly interconnected convolutional layers with concatenations.
We propose a method to handle concatenation layers, based on the connectivity graph of convolutional layers.
We deploy pruned models to FPGA and NVIDIA Jetson Xavier AGX.
arXiv Detail & Related papers (2024-05-04T19:40:42Z) - Trainable Fixed-Point Quantization for Deep Learning Acceleration on
FPGAs [30.325651150798915]
Quantization is a crucial technique for deploying deep learning models on resource-constrained devices, such as embedded FPGAs.
We present QFX, a trainable fixed-point quantization approach that automatically learns the binary-point position during model training.
QFX is implemented as a PyTorch-based library that efficiently emulates fixed-point arithmetic, supported by FPGA HLS.
arXiv Detail & Related papers (2024-01-31T02:18:27Z) - Symbolic Regression on FPGAs for Fast Machine Learning Inference [2.0920303420933273]
High-energy physics community is investigating the potential of deploying machine-learning-based solutions on Field-Programmable Gate Arrays (FPGAs)
We introduce a novel end-to-end procedure that utilizes a machine learning technique called symbolic regression (SR)
We show that our approach can approximate a 3-layer neural network using an inference model that achieves up to a 13-fold decrease in execution time, down to 5 ns, while still preserving more than 90% approximation accuracy.
arXiv Detail & Related papers (2023-05-06T17:04:02Z) - HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on
FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs.
The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics.
The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z) - Quantized Neural Networks for Low-Precision Accumulation with Guaranteed
Overflow Avoidance [68.8204255655161]
We introduce a quantization-aware training algorithm that guarantees avoiding numerical overflow when reducing the precision of accumulators during inference.
We evaluate our algorithm across multiple quantized models that we train for different tasks, showing that our approach can reduce the precision of accumulators while maintaining model accuracy with respect to a floating-point baseline.
arXiv Detail & Related papers (2023-01-31T02:46:57Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - LCS: Learning Compressible Subspaces for Adaptive Network Compression at
Inference Time [57.52251547365967]
We propose a method for training a "compressible subspace" of neural networks that contains a fine-grained spectrum of models.
We present results for achieving arbitrarily fine-grained accuracy-efficiency trade-offs at inference time for structured and unstructured sparsity.
Our algorithm extends to quantization at variable bit widths, achieving accuracy on par with individually trained networks.
arXiv Detail & Related papers (2021-10-08T17:03:34Z) - Compact representations of convolutional neural networks via weight
pruning and quantization [63.417651529192014]
We propose a novel storage format for convolutional neural networks (CNNs) based on source coding and leveraging both weight pruning and quantization.
We achieve a reduction of space occupancy up to 0.6% on fully connected layers and 5.44% on the whole network, while performing at least as competitive as the baseline.
arXiv Detail & Related papers (2021-08-28T20:39:54Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.