ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution
- URL: http://arxiv.org/abs/2308.15807v1
- Date: Wed, 30 Aug 2023 07:23:32 GMT
- Title: ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution
- Authors: Tun-Hao Yang, and Tian-Sheuan Chang
- Abstract summary: Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth.
This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge.
The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
- Score: 0.0502254944841629
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning-driven superresolution (SR) outperforms traditional techniques
but also faces the challenge of high complexity and memory bandwidth. This
challenge leads many accelerators to opt for simpler and shallow models like
FSRCNN, compromising performance for real-time needs, especially for
resource-limited edge devices. This paper proposes an energy-efficient SR
accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality
by 0.34dB with a 27-layer model, but needs 36\% less complexity than FSRCNN,
while maintaining a similar model size, with the \textit{decoupled asymmetric
convolution and split-bypass structure}. The hardware-friendly 17K-parameter
model enables \textit{holistic model fusion} instead of localized layer fusion
to remove external DRAM access of intermediate feature maps. The on-chip memory
bandwidth is further reduced with the \textit{input stationary flow} and
\textit{parallel-layer execution} to reduce power consumption. Hardware is
regular and easy to control to support different layers by \textit{processing
elements (PEs) clusters with reconfigurable input and uniform data flow}. The
implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198KB
SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for x2 and x4 scales Full-HD
generation, respectively, which attains 4.75 TOPS/W energy efficiency.
Related papers
- FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels.
It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup.
MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.
MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.
Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z) - ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE)
ParFormer improves feature extraction by combining convolutional and attention mechanisms.
For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S.
The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z) - Dynamic Decision Tree Ensembles for Energy-Efficient Inference on IoT
Edge Nodes [12.99136544903102]
Decision tree ensembles, such as Random Forests (RFs) and Gradient Boosting (GBTs) are particularly suited for this task, given their relatively low complexity.
This paper proposes the use of dynamic ensembles, that adjust the number of executed trees based both on a latency/energy target and on the complexity of the processed input.
We focus on deploying these algorithms on multi-core low-power IoT devices, designing a tool that automatically converts a Python ensemble into optimized C code.
arXiv Detail & Related papers (2023-06-16T11:59:18Z) - RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on
Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power.
We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z) - BSRA: Block-based Super Resolution Accelerator with Hardware Efficient
Pixel Attention [0.10547353841674209]
This paper proposes a super resolution hardware accelerator with hardware efficient pixel attention.
The final implementation can support full HD image reconstruction at 30 frames per second with TSMC 40nm CMOS process.
arXiv Detail & Related papers (2022-05-02T09:56:29Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs.
We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models.
We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z) - MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost.
A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime.
For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z) - PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal
Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique.
The growth of model size poses a key energy efficiency challenge for the underlying computing platform.
This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.