Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural
Network Inference
- URL: http://arxiv.org/abs/2112.02346v1
- Date: Sat, 4 Dec 2021 14:23:24 GMT
- Title: Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural
Network Inference
- Authors: Erwei Wang, James J. Davis, Georgios-Ilias Stavrou, Peter Y. K.
Cheung, George A. Constantinides, Mohamed Abdelfattah
- Abstract summary: We propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs.
Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K.
We propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference.
- Score: 3.2296078260106174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: FPGA-specific DNN architectures using the native LUTs as independently
trainable inference operators have been shown to achieve favorable
area-accuracy and energy-accuracy tradeoffs. The first work in this area,
LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In
this paper, we propose the learned optimization of such LUT-based topologies,
resulting in higher-efficiency designs than via the direct use of
off-the-shelf, hand-designed networks. Existing implementations of this class
of architecture require the manual specification of the number of inputs per
LUT, K. Choosing appropriate K a priori is challenging, and doing so at even
high granularity, e.g. per layer, is a time-consuming and error-prone process
that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works
see LUT inputs connected randomly, which does not guarantee a good choice of
network topology. To address these issues, we propose logic shrinkage, a
fine-grained netlist pruning methodology enabling K to be automatically learned
for every LUT in a neural network targeted for FPGA inference. By removing LUT
inputs determined to be of low importance, our method increases the efficiency
of the resultant accelerators. Our GPU-friendly solution to LUT input removal
is capable of processing large topologies during their training with negligible
slowdown. With logic shrinkage, we better the area and energy efficiency of the
best-performing LUTNet implementation of the CNV network classifying CIFAR-10
by 1.54x and 1.31x, respectively, while matching its accuracy. This
implementation also reaches 2.71x the area efficiency of an equally accurate,
heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment
of logic shrinkage results in a post-synthesis area reduction of 2.67x vs
LUTNet, allowing for implementation that was previously impossible on today's
largest FPGAs.
Related papers
- NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions [2.7086888205833968]
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks.
We propose relaxing the boundaries of neurons and mapping entire sub-networks to a single LUT.
We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST.
arXiv Detail & Related papers (2024-02-29T16:10:21Z) - OTOv3: Automatic Architecture-Agnostic Neural Network Training and
Compression from Structured Pruning to Erasing Operators [57.145175475579315]
This topic spans various techniques, from structured pruning to neural architecture search, encompassing both pruning and erasing operators perspectives.
We introduce the third-generation Only-Train-Once (OTOv3), which first automatically trains and compresses a general DNN through pruning and erasing operations.
Our empirical results demonstrate the efficacy of OTOv3 across various benchmarks in structured pruning and neural architecture search.
arXiv Detail & Related papers (2023-12-15T00:22:55Z) - FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems [62.20308752994373]
We propose a new smart network interface card (NIC) for distributed AI training systems using field-programmable gate arrays (FPGAs)
Our proposed FPGA-based AI smart NIC enhances overall training performance by 1.6x at 6 nodes, with an estimated 2.5x performance improvement at 32 nodes, compared to the baseline system using conventional NICs.
arXiv Detail & Related papers (2022-04-22T21:57:00Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms.
This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z) - A Meta-Learning Approach to the Optimal Power Flow Problem Under
Topology Reconfigurations [69.73803123972297]
We propose a DNN-based OPF predictor that is trained using a meta-learning (MTL) approach.
The developed OPF-predictor is validated through simulations using benchmark IEEE bus systems.
arXiv Detail & Related papers (2020-12-21T17:39:51Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - Fully-parallel Convolutional Neural Network Hardware [0.7829352305480285]
We propose a new power-and-area-efficient architecture for implementing Articial Neural Networks (ANNs) in hardware.
For the first time, a fully-parallel CNN as LENET-5 is embedded and tested in a single FPGA.
arXiv Detail & Related papers (2020-06-22T17:19:09Z) - A Learning Framework for n-bit Quantized Neural Networks toward FPGAs [20.83904734716565]
This paper proposes a novel learning framework for n-bit QNNs, whose weights are constrained to the power of two.
We also propose a novel QNN structure named n-BQ-NN, which uses shift operation to replace the multiply operation.
Experiments show that our n-BQ-NN with our SVPE can execute 2.9 times faster than with the vector processing element (VPE) in inference.
arXiv Detail & Related papers (2020-04-06T04:21:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.