Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural
Network Inference
- URL: http://arxiv.org/abs/2112.02346v1
- Date: Sat, 4 Dec 2021 14:23:24 GMT
- Title: Logic Shrinkage: Learned FPGA Netlist Sparsity for Efficient Neural
Network Inference
- Authors: Erwei Wang, James J. Davis, Georgios-Ilias Stavrou, Peter Y. K.
Cheung, George A. Constantinides, Mohamed Abdelfattah
- Abstract summary: We propose the learned optimization of such LUT-based topologies, resulting in higher-efficiency designs.
Existing implementations of this class of architecture require the manual specification of the number of inputs per LUT, K.
We propose logic shrinkage, a fine-grained netlist pruning methodology enabling K to be automatically learned for every LUT in a neural network targeted for FPGA inference.
- Score: 3.2296078260106174
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: FPGA-specific DNN architectures using the native LUTs as independently
trainable inference operators have been shown to achieve favorable
area-accuracy and energy-accuracy tradeoffs. The first work in this area,
LUTNet, exhibited state-of-the-art performance for standard DNN benchmarks. In
this paper, we propose the learned optimization of such LUT-based topologies,
resulting in higher-efficiency designs than via the direct use of
off-the-shelf, hand-designed networks. Existing implementations of this class
of architecture require the manual specification of the number of inputs per
LUT, K. Choosing appropriate K a priori is challenging, and doing so at even
high granularity, e.g. per layer, is a time-consuming and error-prone process
that leaves FPGAs' spatial flexibility underexploited. Furthermore, prior works
see LUT inputs connected randomly, which does not guarantee a good choice of
network topology. To address these issues, we propose logic shrinkage, a
fine-grained netlist pruning methodology enabling K to be automatically learned
for every LUT in a neural network targeted for FPGA inference. By removing LUT
inputs determined to be of low importance, our method increases the efficiency
of the resultant accelerators. Our GPU-friendly solution to LUT input removal
is capable of processing large topologies during their training with negligible
slowdown. With logic shrinkage, we better the area and energy efficiency of the
best-performing LUTNet implementation of the CNV network classifying CIFAR-10
by 1.54x and 1.31x, respectively, while matching its accuracy. This
implementation also reaches 2.71x the area efficiency of an equally accurate,
heavily pruned BNN. On ImageNet with the Bi-Real Net architecture, employment
of logic shrinkage results in a post-synthesis area reduction of 2.67x vs
LUTNet, allowing for implementation that was previously impossible on today's
largest FPGAs.
Related papers
- LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference [25.342107763021147]
This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications.
By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators.
arXiv Detail & Related papers (2024-11-01T02:54:11Z) - Fast Graph Sharpness-Aware Minimization for Enhancing and Accelerating Few-Shot Node Classification [53.727688136434345]
Graph Neural Networks (GNNs) have shown superior performance in node classification.
We present Fast Graph Sharpness-Aware Minimization (FGSAM) that integrates the rapid training of Multi-Layer Perceptrons with the superior performance of GNNs.
Our proposed algorithm outperforms the standard SAM with lower computational costs in FSNC tasks.
arXiv Detail & Related papers (2024-10-22T09:33:29Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Hardware-Software Co-optimised Fast and Accurate Deep Reconfigurable Spiking Inference Accelerator Architecture Design Methodology [2.968768532937366]
Spiking Neural Networks (SNNs) have emerged as a promising approach to improve the energy efficiency of machine learning models.
We develop a hardware-software co-optimisation strategy to port software-trained deep neural networks (DNN) to reduced-precision spiking models.
arXiv Detail & Related papers (2024-10-07T05:04:13Z) - NeuraLUT: Hiding Neural Network Density in Boolean Synthesizable Functions [2.7086888205833968]
Field-Programmable Gate Array (FPGA) accelerators have proven successful in handling latency- and resource-critical deep neural network (DNN) inference tasks.
We propose relaxing the boundaries of neurons and mapping entire sub-networks to a single LUT.
We validate our proposed method on a known latency-critical task, jet substructure tagging, and on the classical computer vision task, digit classification using MNIST.
arXiv Detail & Related papers (2024-02-29T16:10:21Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - NullaNet Tiny: Ultra-low-latency DNN Inference Through Fixed-function
Combinational Logic [4.119948826527649]
Field-programmable gate array (FPGA)-based accelerators are gaining traction as a serious contender to replace graphics processing unit/central processing unit-based platforms.
This paper presents NullaNet Tiny, a framework for constructing resource and energy-efficient, ultra-low-latency FPGA-based neural network accelerators.
arXiv Detail & Related papers (2021-04-07T00:16:39Z) - ALF: Autoencoder-based Low-rank Filter-sharing for Efficient
Convolutional Neural Networks [63.91384986073851]
We propose the autoencoder-based low-rank filter-sharing technique technique (ALF)
ALF shows a reduction of 70% in network parameters, 61% in operations and 41% in execution time, with minimal loss in accuracy.
arXiv Detail & Related papers (2020-07-27T09:01:22Z) - Fully-parallel Convolutional Neural Network Hardware [0.7829352305480285]
We propose a new power-and-area-efficient architecture for implementing Articial Neural Networks (ANNs) in hardware.
For the first time, a fully-parallel CNN as LENET-5 is embedded and tested in a single FPGA.
arXiv Detail & Related papers (2020-06-22T17:19:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.