SPEC2: SPECtral SParsE CNN Accelerator on FPGAs
- URL: http://arxiv.org/abs/1910.11103v2
- Date: Wed, 11 Oct 2023 00:11:45 GMT
- Title: SPEC2: SPECtral SParsE CNN Accelerator on FPGAs
- Authors: Yue Niu, Hanqing Zeng, Ajitesh Srivastava, Kartik Lakhotia, Rajgopal
Kannan, Yanzhi Wang, Viktor Prasanna
- Abstract summary: We propose SPEC2 -- the first work to prune and accelerate spectral CNNs.
We design an optimized pipeline architecture on FPGA that has efficient random access into sparse kernels.
The resulting accelerators achieve up to 24x higher throughput, compared with the state-of-the-art FPGA implementations for VGG16.
- Score: 31.31419913907224
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: To accelerate inference of Convolutional Neural Networks (CNNs), various
techniques have been proposed to reduce computation redundancy. Converting
convolutional layers into frequency domain significantly reduces the
computation complexity of the sliding window operations in space domain. On the
other hand, weight pruning techniques address the redundancy in model
parameters by converting dense convolutional kernels into sparse ones. To
obtain high-throughput FPGA implementation, we propose SPEC2 -- the first work
to prune and accelerate spectral CNNs. First, we propose a systematic pruning
algorithm based on Alternative Direction Method of Multipliers (ADMM). The
offline pruning iteratively sets the majority of spectral weights to zero,
without using any handcrafted heuristics. Then, we design an optimized pipeline
architecture on FPGA that has efficient random access into the sparse kernels
and exploits various dimensions of parallelism in convolutional layers.
Overall, SPEC2 achieves high inference throughput with extremely low
computation complexity and negligible accuracy degradation. We demonstrate
SPEC2 by pruning and implementing LeNet and VGG16 on the Xilinx Virtex
platform. After pruning 75% of the spectral weights, SPEC2 achieves 0% accuracy
loss for LeNet, and <1% accuracy loss for VGG16. The resulting accelerators
achieve up to 24x higher throughput, compared with the state-of-the-art FPGA
implementations for VGG16.
Related papers
- LUTMUL: Exceed Conventional FPGA Roofline Limit by LUT-based Efficient Multiplication for Neural Network Inference [25.342107763021147]
This paper introduces LUTMUL, which harnesses the potential of look-up tables (LUTs) for performing multiplications.
By exploiting this advantage of LUTs, our method demonstrates a potential boost in the performance of FPGA-based neural network accelerators.
arXiv Detail & Related papers (2024-11-01T02:54:11Z) - Quasar-ViT: Hardware-Oriented Quantization-Aware Architecture Search for Vision Transformers [56.37495946212932]
Vision transformers (ViTs) have demonstrated their superior accuracy for computer vision tasks compared to convolutional neural networks (CNNs)
This work proposes Quasar-ViT, a hardware-oriented quantization-aware architecture search framework for ViTs.
arXiv Detail & Related papers (2024-07-25T16:35:46Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields.
FPGAs have seen a surge in interest for accelerating CNN inference.
Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - Automatic Mapping of the Best-Suited DNN Pruning Schemes for Real-Time
Mobile Acceleration [71.80326738527734]
We propose a general, fine-grained structured pruning scheme and corresponding compiler optimizations.
We show that our pruning scheme mapping methods, together with the general fine-grained structured pruning scheme, outperform the state-of-the-art DNN optimization framework.
arXiv Detail & Related papers (2021-11-22T23:53:14Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - Learning N:M Fine-grained Structured Sparse Neural Networks From Scratch [75.69506249886622]
Sparsity in Deep Neural Networks (DNNs) has been widely studied to compress and accelerate the models on resource-constrained environments.
In this paper, we are the first to study training from scratch an N:M fine-grained structured sparse network.
arXiv Detail & Related papers (2021-02-08T05:55:47Z) - BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio
Sparsification [3.3711251611130337]
A hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented.
Results show that the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.
arXiv Detail & Related papers (2021-01-07T18:23:48Z) - A fully pipelined FPGA accelerator for scale invariant feature transform
keypoint descriptor matching, [0.0]
We design a novel fully pipelined hardware accelerator architecture for SIFT keypoint descriptor matching.
The proposed hardware architecture is able to properly handle the memory bandwidth necessary for a fully-pipelined implementation.
Our hardware implementation is 15.7 times faster than the comparable software approach.
arXiv Detail & Related papers (2020-12-17T15:29:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.