Related papers: A Real Time Super Resolution Accelerator with Tilted Layer Fusion

A Real Time Super Resolution Accelerator with Tilted Layer Fusion

URL: http://arxiv.org/abs/2205.03997v1
Date: Mon, 9 May 2022 01:47:02 GMT
Title: A Real Time Super Resolution Accelerator with Tilted Layer Fusion
Authors: An-Jung Huang, Kai-Chieh Hsu and Tian-Sheuan Chang
Abstract summary: This paper proposes a real-time hardware accelerator with the tilted layer fusion method that reduces the external DRAM bandwidth by 92% and just needs 102KB on-chip memory. The design implemented with a 40nm CMOS process achieves 1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it has higher throughput and lower area cost than previous designs.
Score: 0.10547353841674209
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning based superresolution achieves high-quality results, but its heavy computational workload, large buffer, and high external memory bandwidth inhibit its usage in mobile devices. To solve the above issues, this paper proposes a real-time hardware accelerator with the tilted layer fusion method that reduces the external DRAM bandwidth by 92\% and just needs 102KB on-chip memory. The design implemented with a 40nm CMOS process achieves 1920x1080@60fps throughput with 544.3K gate count when running at 600MHz; it has higher throughput and lower area cost than previous designs.

Related papers

A Low-Power Streaming Speech Enhancement Accelerator For Edge Devices [0.0502254944841629]
Transformer-based speech enhancement models yield impressive results, but their structure restricts model compression potential. This paper proposes a low-power streaming speech enhancement accelerator through model and hardware optimization. The proposed high performance model is optimized for hardware execution with the co-design of model compression and target application.
arXiv Detail & Related papers (2025-03-27T10:13:41Z)
LiVOS: Light Video Object Segmentation with Gated Linear Matching [116.58237547253935]
LiVOS is a lightweight memory network that employs linear matching via linear attention. For longer and higher-resolution videos, it matched STM-based methods with 53% less GPU memory and supports 4096p inference on a 32G consumer-grade GPU.
arXiv Detail & Related papers (2024-11-05T05:36:17Z)
Endor: Hardware-Friendly Sparse Format for Offloaded LLM Inference [47.043257902725294]
We propose a novel sparse format that compresses unstructured sparse pattern of pruned LLM weights to non-zero values with high compression ratio and low decompression overhead. Compared to offloaded inference using the popular Huggingface Accelerate, applying Endor accelerates OPT-66B by 1.70x and Llama2-70B by 1.78x.
arXiv Detail & Related papers (2024-06-17T15:55:08Z)
ACNPU: A 4.75TOPS/W 1080P@30FPS Super Resolution Accelerator with Decoupled Asymmetric Convolution [0.0502254944841629]
Deep learning-driven superresolution (SR) outperforms traditional techniques but also faces the challenge of high complexity and memory bandwidth. This paper proposes an energy-efficient SR accelerator, ACNPU, to tackle this challenge. The ACNPU enhances image quality by 0.34dB with a 27-layer model, but needs 36% less complexity than FSRCNN.
arXiv Detail & Related papers (2023-08-30T07:23:32Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [80.3586155104237]
FlashAttention is an IO-aware exact attention algorithm for Transformers. It reduces the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip. FlashAttention and block-sparse FlashAttention enable longer context in Transformers.
arXiv Detail & Related papers (2022-05-27T17:53:09Z)
A Real Time 1280x720 Object Detection Chip With 585MB/s Memory Traffic [1.553339756999288]
This paper proposes a low memory traffic DLA chip with joint hardware and software optimization. To maximize hardware utilization under memory bandwidth, we morph and fuse the object detection model into a group fusion-ready model. This reduces the YOLOv2's feature memory traffic from 2.9 GB/s to 0.15 GB/s.
arXiv Detail & Related papers (2022-05-02T09:58:39Z)
BSRA: Block-based Super Resolution Accelerator with Hardware Efficient Pixel Attention [0.10547353841674209]
This paper proposes a super resolution hardware accelerator with hardware efficient pixel attention. The final implementation can support full HD image reconstruction at 30 frames per second with TSMC 40nm CMOS process.
arXiv Detail & Related papers (2022-05-02T09:56:29Z)
Projected GANs Converge Faster [50.23237734403834]
Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. We make significant headway on these issues by projecting generated and real samples into a fixed, pretrained feature space. Our Projected GAN improves image quality, sample efficiency, and convergence speed.
arXiv Detail & Related papers (2021-11-01T15:11:01Z)
A TinyML Platform for On-Device Continual Learning with Quantized Latent Replays [66.62377866022221]
Latent Replay-based Continual Learning (CL) techniques enable online, serverless adaptation in principle. We introduce a HW/SW platform for end-to-end CL based on a 10-core FP32-enabled parallel ultra-low-power processor. Our results show that by combining these techniques, continual learning can be achieved in practice using less than 64MB of memory.
arXiv Detail & Related papers (2021-10-20T11:01:23Z)
ATTACC the Quadratic Bottleneck of Attention Layers [3.2741800634280245]
This paper introduces a new attention-tailored dataflow, termed FLAT, for deep neural network (DNN) accelerators. It increases the effective memory bandwidth by efficiently utilizing the high-bandwidth, low-capacity on-chip buffer. In our evaluation, ATTACC achieves 1.94x and 1.76x speedup and 49% and 42% of energy reduction compared to state-of-the-art edge and cloud accelerators.
arXiv Detail & Related papers (2021-07-13T22:23:40Z)
Low Latency CMOS Hardware Acceleration for Fully Connected Layers in Deep Neural Networks [1.9036571490366496]
The FC accelerator, FC-ACCL, is based on 128 8x8 or 16x16 processing elements for matrix-vector multiplication. The design can reduce latency for the large FC6 layer by 60 % in AlexNet and by 3 % in VGG16 when compared to an alternative EIE solution.
arXiv Detail & Related papers (2020-11-25T15:49:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.