Related papers: An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes

URL: http://arxiv.org/abs/2402.14307v1
Date: Thu, 22 Feb 2024 05:52:55 GMT
Title: An FPGA-Based Accelerator Enabling Efficient Support for CNNs with Arbitrary Kernel Sizes
Authors: Miaoxin Wang, Xiao Wu, Jun Lin, Zhongfeng Wang
Abstract summary: Convolutional neural networks (CNNs) with large kernels have demonstrated impressive performance in various vision-based applications. An FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network.
Score: 11.681245043617848
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolutional neural networks (CNNs) with large kernels, drawing inspiration from the key operations of vision transformers (ViTs), have demonstrated impressive performance in various vision-based applications. To address the issue of computational efficiency degradation in existing designs for supporting large-kernel convolutions, an FPGA-based inference accelerator is proposed for the efficient deployment of CNNs with arbitrary kernel sizes. Firstly, a Z-flow method is presented to optimize the computing data flow by maximizing data reuse opportunity. Besides, the proposed design, incorporating the kernel-segmentation (Kseg) scheme, enables extended support for large-kernel convolutions, significantly reducing the storage requirements for overlapped data. Moreover, based on the analysis of typical block structures in emerging CNNs, vertical-fused (VF) and horizontal-fused (HF) methods are developed to optimize CNN deployments from both computation and transmission perspectives. The proposed hardware accelerator, evaluated on Intel Arria 10 FPGA, achieves up to 3.91 times better DSP efficiency than prior art on the same network. Particularly, it demonstrates efficient support for large-kernel CNNs, achieving throughputs of 169.68 GOPS and 244.55 GOPS for RepLKNet-31 and PyConvResNet-50, respectively, both of which are implemented on hardware for the first time.

Related papers

DCP: Learning Accelerator Dataflow for Neural Network via Propagation [52.06154296196845]
This work proposes an efficient data-centric approach, named Dataflow Code Propagation (DCP), to automatically find the optimal dataflow for DNN layers in seconds without human effort. DCP learns a neural predictor to efficiently update the dataflow codes towards the desired gradient directions to minimize various optimization objectives. For example, without using additional training data, DCP surpasses the GAMMA method that performs a full search using thousands of samples.
arXiv Detail & Related papers (2024-10-09T05:16:44Z)
Reconfigurable Distributed FPGA Cluster Design for Deep Learning Accelerators [59.11160990637615]
We propose a distributed system based on lowpower embedded FPGAs designed for edge computing applications. The proposed system can simultaneously execute diverse Neural Network (NN) models, arrange the graph in a pipeline structure, and manually allocate greater resources to the most computationally intensive layers of the NN graph.
arXiv Detail & Related papers (2023-05-24T16:08:55Z)
Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields. FPGAs have seen a surge in interest for accelerating CNN inference. Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z)
LKD-Net: Large Kernel Convolution Network for Single Image Dehazing [70.46392287128307]
We propose a novel Large Kernel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN) The designed DLKCB can split the deep-wise large kernel convolution into a smaller depth-wise convolution and a depth-wise dilated convolution without introducing massive parameters and computational overhead. Our LKD-Net dramatically outperforms the Transformer-based method Dehamer with only 1.79% #Param and 48.9% FLOPs.
arXiv Detail & Related papers (2022-09-05T06:56:48Z)
FFCNN: Fast FPGA based Acceleration for Convolution neural network inference [0.0]
We present Fast Inference on FPGAs for Convolution Neural Network (FFCNN) FFCNN is based on a deeply pipelined OpenCL kernels architecture. Data reuse and task mapping techniques are also presented to improve design efficiency.
arXiv Detail & Related papers (2022-08-28T16:55:25Z)
Lightweight and Progressively-Scalable Networks for Semantic Segmentation [100.63114424262234]
Multi-scale learning frameworks have been regarded as a capable class of models to boost semantic segmentation. In this paper, we thoroughly analyze the design of convolutional blocks and the ways of interactions across multiple scales. We devise Lightweight and Progressively-Scalable Networks (LPS-Net) that novelly expands the network complexity in a greedy manner.
arXiv Detail & Related papers (2022-07-27T16:00:28Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
A Design Flow for Mapping Spiking Neural Networks to Many-Core Neuromorphic Hardware [4.527975416669432]
Many-core neuromorphic hardware is expected to execute large machine learning models. To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance. We propose an SDFG-based design flow for mapping spiking neural networks to many-core neuromorphic hardware.
arXiv Detail & Related papers (2021-08-27T18:08:08Z)
WinoCNN: Kernel Sharing Winograd Systolic Array for Efficient Convolutional Neural Network Acceleration on FPGAs [8.73707548868892]
We are first to propose an optimized Winograd processing element (WinoPE) We construct a highly efficient systolic array accelerator, termed WinoCNN. We implement our proposed accelerator on multiple FPGAs, which outperforms the state-of-the-art designs in terms of both throughput and DSP efficiency.
arXiv Detail & Related papers (2021-07-09T06:37:47Z)
Multi-Exit Semantic Segmentation Networks [78.44441236864057]
We propose a framework for converting state-of-the-art segmentation models to MESS networks. specially trained CNNs that employ parametrised early exits along their depth to save during inference on easier samples. We co-optimise the number, placement and architecture of the attached segmentation heads, along with the exit policy, to adapt to the device capabilities and application-specific requirements.
arXiv Detail & Related papers (2021-06-07T11:37:03Z)
FPGA deep learning acceleration based on convolutional neural network [0.0]
This paper proposes a convolutional neural network hardware accelerator based on field programmable logic gate array (FPGA) The energy efficiency ratio of the accelerator proposed in this article reaches 32.73 GOPS/W, which is 34% higher than the existing solution, and the performance reaches 317.86 GOPS.
arXiv Detail & Related papers (2020-11-17T16:20:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.