Related papers: Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation

Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation

URL: http://arxiv.org/abs/2307.13412v1
Date: Tue, 25 Jul 2023 11:19:21 GMT
Title: Mitigating Memory Wall Effects in CNN Engines with On-the-Fly Weights Generation
Authors: Stylianos I. Venieris, Javier Fernandez-Marques, Nicholas D. Lane
Abstract summary: unzipFPGA is a novel CNN inference system that counteracts the limitations of existing CNN engines. We introduce a weights generator module that enables the on-chip on-the-fly generation of weights. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair.
Score: 13.681095158525514
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The unprecedented accuracy of convolutional neural networks (CNNs) across a broad range of AI tasks has led to their widespread deployment in mobile and embedded settings. In a pursuit for high-performance and energy-efficient inference, significant research effort has been invested in the design of FPGA-based CNN accelerators. In this context, single computation engines constitute a popular approach to support diverse CNN modes without the overhead of fabric reconfiguration. Nevertheless, this flexibility often comes with significantly degraded performance on memory-bound layers and resource underutilisation due to the suboptimal mapping of certain layers on the engine's fixed configuration. In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-convolution stage to decompress the weights at run time. We refer to these approaches as on-the-fly. This paper presents unzipFPGA, a novel CNN inference system that counteracts the limitations of existing CNN engines. The proposed framework comprises a novel CNN hardware architecture that introduces a weights generator module that enables the on-chip on-the-fly generation of weights, alleviating the negative impact of limited bandwidth on memory-bound layers. We further enhance unzipFPGA with an automated hardware-aware methodology that tailors the weights generation mechanism to the target CNN-device pair, leading to an improved accuracy-performance balance. Finally, we introduce an input selective processing element (PE) design that balances the load between PEs in suboptimally mapped layers. The proposed framework yields hardware designs that achieve an average of 2.57x performance efficiency gain over highly optimised GPU designs for the same power constraints and up to 3.94x higher performance density over a diverse range of state-of-the-art FPGA-based CNN accelerators.

Related papers

Task-Oriented Real-time Visual Inference for IoVT Systems: A Co-design Framework of Neural Networks and Edge Deployment [61.20689382879937]
Task-oriented edge computing addresses this by shifting data analysis to the edge. Existing methods struggle to balance high model performance with low resource consumption. We propose a novel co-design framework to optimize neural network architecture.
arXiv Detail & Related papers (2024-10-29T19:02:54Z)
Dynamic Semantic Compression for CNN Inference in Multi-access Edge Computing: A Graph Reinforcement Learning-based Autoencoder [82.8833476520429]
We propose a novel semantic compression method, autoencoder-based CNN architecture (AECNN) for effective semantic extraction and compression in partial offloading. In the semantic encoder, we introduce a feature compression module based on the channel attention mechanism in CNNs, to compress intermediate data by selecting the most informative features. In the semantic decoder, we design a lightweight decoder to reconstruct the intermediate data through learning from the received compressed data to improve accuracy.
arXiv Detail & Related papers (2024-01-19T15:19:47Z)
HARFLOW3D: A Latency-Oriented 3D-CNN Accelerator Toolflow for HAR on FPGA Devices [71.45672882756001]
This study introduces a novel streaming architecture based toolflow for mapping 3D Convolutional Neural Networks onto FPGAs. The HARFLOW3D toolflow takes as input a 3D CNN in ONNX format and a description of the FPGA characteristics. The ability of the toolflow to support a broad range of models and devices is shown through a number of experiments on various 3D CNN and FPGA system pairs.
arXiv Detail & Related papers (2023-03-30T08:25:27Z)
Optimization of FPGA-based CNN Accelerators Using Metaheuristics [1.854931308524932]
convolutional neural networks (CNNs) have demonstrated their ability to solve problems in many fields. FPGAs have seen a surge in interest for accelerating CNN inference. Current trend in FPGA-based CNN accelerators is to implement multiple convolutional layer processors (CLPs)
arXiv Detail & Related papers (2022-09-22T18:57:49Z)
Photonic Reconfigurable Accelerators for Efficient Inference of CNNs with Mixed-Sized Tensors [0.22843885788439797]
Photonic Microring Resonator (MRR) based hardware accelerators have been shown to provide disruptive speedup and energy-efficiency improvements. Previous MRR-based CNN accelerators fail to provide efficient adaptability for CNNs with mixed-sized tensors. We present a novel way of introducing reconfigurability in the MRR-based CNN accelerators.
arXiv Detail & Related papers (2022-07-12T03:18:00Z)
Towards Enabling Dynamic Convolution Neural Network Inference for Edge Intelligence [0.0]
Recent advances in edge intelligence require CNN inference on edge network to increase throughput and reduce latency. To provide flexibility, dynamic parameter allocation to different mobile devices is required to implement either a predefined or defined on-the-fly CNN architecture. We propose a library-based approach to design scalable and dynamic distributed CNN inference on the fly.
arXiv Detail & Related papers (2022-02-18T22:33:42Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
unzipFPGA: Enhancing FPGA-based CNN Engines with On-the-Fly Weights Generation [17.142094527372993]
Singlevolution engines have become a popular design choice for FPGA-based convolutional neural networks (CNNs) In this work, we investigate the implications in terms of CNN engine design for a class of models that introduce a pre-con stage to decompress the weights at run time. To minimise the negative impact of limited bandwidth on memory-bound layers, we present a novel hardware component that enables the on-the-fly generation of weights.
arXiv Detail & Related papers (2021-03-09T18:19:41Z)
Evolutionary Bin Packing for Memory-Efficient Dataflow Inference Acceleration on FPGA [2.3395728784538767]
Convolutional neural network (CNN) dataflow inference accelerators implemented in Field Programmable Gate Arrays (FPGAs) have demonstrated increased energy efficiency and lower latency. However, the shapes complex of CNN parameter memories do not typically map well to FPGA on-chip memories (OCM) We present a design methodology that improves the mapping efficiency of CNN parameters to FPGA OCM.
arXiv Detail & Related papers (2020-03-24T09:55:08Z)
PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning [57.20262984116752]
We introduce a new dimension, fine-grained pruning patterns inside the coarse-grained structures, revealing a previously unknown point in design space. With the higher accuracy enabled by fine-grained pruning patterns, the unique insight is to use the compiler to re-gain and guarantee high hardware efficiency.
arXiv Detail & Related papers (2020-01-01T04:52:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.