ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution
- URL: http://arxiv.org/abs/2308.15807v1
- Date: Wed, 30 Aug 2023 07:23:32 GMT
- Title: ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution
- Authors: Tun-Hao Yang, and Tian-Sheuan Chang
- Abstract summary: Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth.
This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge.
The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
- Score: 0.0502254944841629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning-driven superresolution (SR) outperforms traditional techniques
but also faces the challenge of high complexity and memory bandwidth. This
challenge leads many accelerators to opt for simpler and shallow models like
FSRCNN, compromising performance for real-time needs, especially for
resource-limited edge devices. This paper proposes an energy-efficient SR
accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality
by 0.34dB with a 27-layer model, but needs 36\% less complexity than FSRCNN,
while maintaining a similar model size, with the \textit{decoupled asymmetric
convolution and split-bypass structure}. The hardware-friendly 17K-parameter
model enables \textit{holistic model fusion} instead of localized layer fusion
to remove external DRAM access of intermediate feature maps. The on-chip memory
bandwidth is further reduced with the \textit{input stationary flow} and
\textit{parallel-layer execution} to reduce power consumption. Hardware is
regular and easy to control to support different layers by \textit{processing
elements (PEs) clusters with reconfigurable input and uniform data flow}. The
implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198KB
SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for x2 and x4 scales Full-HD
generation, respectively, which attains 4.75 TOPS/W energy efficiency.
Related papers
- SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts [9.94373711477696]
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications.
The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall.
Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving.
arXiv Detail & Related papers (2024-05-13T07:32:45Z) - Transforming Image Super-Resolution: A ConvFormer-based Efficient
Approach [63.98380888730723]
We introduce the Convolutional Transformer layer (ConvFormer) and the ConvFormer-based Super-Resolution network (CFSR)
CFSR efficiently models long-range dependencies and extensive receptive fields with a slight computational cost.
It achieves 0.39 dB gains on Urban100 dataset for x2 SR task while containing 26% and 31% fewer parameters and FLOPs, respectively.
arXiv Detail & Related papers (2024-01-11T03:08:00Z) - Dynamic Decision Tree Ensembles for Energy-Efficient Inference on IoT
Edge Nodes [12.99136544903102]
Decision tree ensembles, such as Random Forests (RFs) and Gradient Boosting (GBTs) are particularly suited for this task, given their relatively low complexity.
This paper proposes the use of dynamic ensembles, that adjust the number of executed trees based both on a latency/energy target and on the complexity of the processed input.
We focus on deploying these algorithms on multi-core low-power IoT devices, designing a tool that automatically converts a Python ensemble into optimized C code.
arXiv Detail & Related papers (2023-06-16T11:59:18Z) - RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on
Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power.
We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z) - Adaptable Butterfly Accelerator for Attention-based NNs via Hardware and
Algorithm Co-design [66.39546326221176]
Attention-based neural networks have become pervasive in many AI tasks.
The use of the attention mechanism and feed-forward network (FFN) demands excessive computational and memory resources.
This paper proposes a hardware-friendly variant that adopts a unified butterfly sparsity pattern to approximate both the attention mechanism and the FFNs.
arXiv Detail & Related papers (2022-09-20T09:28:26Z) - BSRA: Block-based Super Resolution Accelerator with Hardware Efficient
Pixel Attention [0.10547353841674209]
This paper proposes a super resolution hardware accelerator with hardware efficient pixel attention.
The final implementation can support full HD image reconstruction at 30 frames per second with TSMC 40nm CMOS process.
arXiv Detail & Related papers (2022-05-02T09:56:29Z) - DS-Net++: Dynamic Weight Slicing for Efficient Inference in CNNs and
Transformers [105.74546828182834]
We show a hardware-efficient dynamic inference regime, named dynamic weight slicing, which adaptively slice a part of network parameters for inputs with diverse difficulty levels.
We present dynamic slimmable network (DS-Net) and dynamic slice-able network (DS-Net++) by input-dependently adjusting filter numbers of CNNs and multiple dimensions in both CNNs and transformers.
arXiv Detail & Related papers (2021-09-21T09:57:21Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs.
We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models.
We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z) - MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost.
A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime.
For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z) - PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal
Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique.
The growth of model size poses a key energy efficiency challenge for the underlying computing platform.
This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.