A Real Time Super Resolution Accelerator with Tilted Layer Fusion
- URL: http://arxiv.org/abs/2205.03997v1
- Date: Mon, 9 May 2022 01:47:02 GMT
- Title: A Real Time Super Resolution Accelerator with Tilted Layer Fusion
- Authors: An-Jung Huang, Kai-Chieh Hsu and Tian-Sheuan Chang
- Abstract summary: This paper proposes a real-time hardware accelerator with the tilted layer fusion method that reduces the external DRAM bandwidth by 92% and just needs 102KB on-chip memory.
The design implemented with a 40nm CMOS process achieves 1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it has higher throughput and lower area cost than previous designs.
- Score: 0.10547353841674209
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Deep learning based superresolution achieves high-quality results, but its
heavy computational workload, large buffer, and high external memory bandwidth
inhibit its usage in mobile devices. To solve the above issues, this paper
proposes a real-time hardware accelerator with the tilted layer fusion method
that reduces the external DRAM bandwidth by 92\% and just needs 102KB on-chip
memory. The design implemented with a 40nm CMOS process achieves
1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it
has higher throughput and lower area cost than previous designs.
Related papers
- Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference [47.043257902725294]
We propose a novel sparse format that compresses unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead.
Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x.
arXiv Detail & Related papers (2024-06-17T15:55:08Z) - ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with
Decoupled Asymmetric Convolution [0.0502254944841629]
Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth.
This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge.
The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
arXiv Detail & Related papers (2023-08-30T07:23:32Z) - EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense
Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention.
Our multi-scale linear attention achieves the global receptive field and multi-scale learning.
EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z) - FlashAttention: Fast and Memory-Efficient Exact Attention with
IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers.
It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip.
FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z) - A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic [1.553339756999288]
This paper proposes a low memory traffic DLA chip with joint hardware and software optimization.
To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model.
This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s.
arXiv Detail & Related papers (2022-05-02T09:58:39Z) - BSRA: Block-based Super Resolution Accelerator with Hardware Efficient
Pixel Attention [0.10547353841674209]
This paper proposes a super resolution hardware accelerator with hardware efficient pixel attention.
The final implementation can support full HD image reconstruction at 30 frames per second with TSMC 40nm CMOS process.
arXiv Detail & Related papers (2022-05-02T09:56:29Z) - Revisiting Multi-Scale Feature Fusion for Semantic Segmentation [90.32746095413447]
In this paper, we demonstrate that neither high internal resolution nor atrous convolutions are necessary for accurate semantic segmentation.
We develop a simplified segmentation model, named ESeg, which has neither high internal resolution nor expensive atrous convolutions.
Our simple method can achieve better accuracy with faster speed than prior art across multiple datasets.
arXiv Detail & Related papers (2022-03-23T19:14:11Z) - Projected GANs Converge Faster [50.23237734403834]
Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train.
We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space.
Our Projected GAN improves image quality, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-11-01T15:11:01Z) - A TinyML Platform for On-Device Continual Learning with Quantized Latent
Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle.
We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor.
Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z) - ATTACC the Quadratic Bottleneck of Attention Layers [3.2741800634280245]
This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators.
It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer.
In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
arXiv Detail & Related papers (2021-07-13T22:23:40Z) - Low Latency CMOS Hardware Acceleration for Fully Connected Layers in
Deep Neural Networks [1.9036571490366496]
The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements for matrix-vector multiplication.
The design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution.
arXiv Detail & Related papers (2020-11-25T15:49:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.