Related papers: ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution

ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution

URL: http://arxiv.org/abs/2308.15807v1
Date: Wed, 30 Aug 2023 07:23:32 GMT
Title: ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution
Authors: Tun-Hao Yang, and Tian-Sheuan Chang
Abstract summary: Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
Score: 0.0502254944841629
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This challenge leads many accelerators to opt for simpler and shallow models like FSRCNN, compromising performance for real-time needs, especially for resource-limited edge devices. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36\% less complexity than FSRCNN, while maintaining a similar model size, with the \textit{decoupled asymmetric convolution and split-bypass structure}. The hardware-friendly 17K-parameter model enables \textit{holistic model fusion} instead of localized layer fusion to remove external DRAM access of intermediate feature maps. The on-chip memory bandwidth is further reduced with the \textit{input stationary flow} and \textit{parallel-layer execution} to reduce power consumption. Hardware is regular and easy to control to support different layers by \textit{processing elements (PEs) clusters with reconfigurable input and uniform data flow}. The implementation in the 40 nm CMOS process consumes 2333 K gate counts and 198KB SRAMs. The ACNPU achieves 31.7 FPS and 124.4 FPS for x2 and x4 scales Full-HD generation, respectively, which attains 4.75 TOPS/W energy efficiency.

Related papers

70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float [71.43026659686679]
Large Language Models (LLMs) have grown rapidly in size, creating challenges for efficient deployment on resource-constrained hardware. We introduce Dynamic-Length Float (DFloat11), a compression framework that reduces LLM size by 30% while preserving outputs that are bit-for-bit identical to the original model.
arXiv Detail & Related papers (2025-04-15T22:38:38Z)
ESSR: An 8K@30FPS Super-Resolution Accelerator With Edge Selective Network [0.0502254944841629]
This paper introduces an 8K@30FPS accelerator with edge-selective dynamic processing. The implementation, using the TSMC 28nm process, can achieve 8K@30FPS at 800MHz with a gate count of 2749K, 0.2075W power consumption, and 4797Mpixels/J energy efficiency.
arXiv Detail & Related papers (2025-03-26T05:27:23Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models [58.3342517278868]
This paper describes the design of Mixed-precision AutoRegressive LINear kernels. It shows that batchsizes up to 16-32 can be supported with close to maximum ($4times$) quantization speedup. MarLIN accomplishes this via a combination of techniques, such as asynchronous memory access, complex task scheduling and pipelining.
arXiv Detail & Related papers (2024-08-21T16:10:41Z)
MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework. MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions. Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
ParFormer: A Vision Transformer with Parallel Mixer and Sparse Channel Attention Patch Embedding [9.144813021145039]
This paper introduces ParFormer, a vision transformer that incorporates a Parallel Mixer and a Sparse Channel Attention Patch Embedding (SCAPE) ParFormer improves feature extraction by combining convolutional and attention mechanisms. For edge device deployment, ParFormer-T excels with a throughput of 278.1 images/sec, which is 1.38 $times$ higher than EdgeNeXt-S. The larger variant, ParFormer-L, reaches 83.5% Top-1 accuracy, offering a balanced trade-off between accuracy and efficiency.
arXiv Detail & Related papers (2024-03-22T07:32:21Z)
Dynamic Decision Tree Ensembles for Energy-Efficient Inference on IoT Edge Nodes [12.99136544903102]
Decision tree ensembles, such as Random Forests (RFs) and Gradient Boosting (GBTs) are particularly suited for this task, given their relatively low complexity. This paper proposes the use of dynamic ensembles, that adjust the number of executed trees based both on a latency/energy target and on the complexity of the processed input. We focus on deploying these algorithms on multi-core low-power IoT devices, designing a tool that automatically converts a Python ensemble into optimized C code.
arXiv Detail & Related papers (2023-06-16T11:59:18Z)
RAMAN: A Re-configurable and Sparse tinyML Accelerator for Inference on Edge [1.8293684411977293]
Deep Neural Network (DNN) based inference at the edge is challenging as these compute and data-intensive algorithms need to be implemented at low cost and low power. We present RAMAN, a Re-configurable and spArse tinyML Accelerator for infereNce on edge, architected to exploit the sparsity to reduce area (storage), power as well as latency.
arXiv Detail & Related papers (2023-06-10T17:25:58Z)
BSRA: Block-based Super Resolution Accelerator with Hardware Efficient Pixel Attention [0.10547353841674209]
This paper proposes a super resolution hardware accelerator with hardware efficient pixel attention. The final implementation can support full HD image reconstruction at 30 frames per second with TSMC 40nm CMOS process.
arXiv Detail & Related papers (2022-05-02T09:56:29Z)
VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator. textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z)
GhostSR: Learning Ghost Features for Efficient Image Super-Resolution [49.393251361038025]
Single image super-resolution (SISR) system based on convolutional neural networks (CNNs) achieves fancy performance while requires huge computational costs. We propose to use shift operation to generate the redundant features (i.e., Ghost features) of SISR models. We show that both the non-compact and lightweight SISR models embedded in our proposed module can achieve comparable performance to that of their baselines.
arXiv Detail & Related papers (2021-01-21T10:09:47Z)
MicroNet: Towards Image Recognition with Extremely Low FLOPs [117.96848315180407]
MicroNet is an efficient convolutional neural network using extremely low computational cost. A family of MicroNets achieve a significant performance gain over the state-of-the-art in the low FLOP regime. For instance, MicroNet-M1 achieves 61.1% top-1 accuracy on ImageNet classification with 12 MFLOPs, outperforming MobileNetV3 by 11.3%.
arXiv Detail & Related papers (2020-11-24T18:59:39Z)
PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices [35.90103072918056]
Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. This paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models.
arXiv Detail & Related papers (2020-04-23T02:26:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.