Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution
- URL: http://arxiv.org/abs/2312.06101v2
- Date: Wed, 8 May 2024 12:36:49 GMT
- Title: Hundred-Kilobyte Lookup Tables for Efficient Single-Image Super-Resolution
- Authors: Binxiao Huang, Jason Chun Lok Li, Jie Ran, Boyu Li, Jiajun Zhou, Dahai Yu, Ngai Wong,
- Abstract summary: Super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations.
This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources.
This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models to amenable to on-chip cache.
- Score: 7.403264755337134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Conventional super-resolution (SR) schemes make heavy use of convolutional neural networks (CNNs), which involve intensive multiply-accumulate (MAC) operations, and require specialized hardware such as graphics processing units. This contradicts the regime of edge AI that often runs on devices strained by power, computing, and storage resources. Such a challenge has motivated a series of lookup table (LUT)-based SR schemes that employ simple LUT readout and largely elude CNN computation. Nonetheless, the multi-megabyte LUTs in existing methods still prohibit on-chip storage and necessitate off-chip memory transport. This work tackles this storage hurdle and innovates hundred-kilobyte LUT (HKLUT) models amenable to on-chip cache. Utilizing an asymmetric two-branch multistage network coupled with a suite of specialized kernel patterns, HKLUT demonstrates an uncompromising performance and superior hardware efficiency over existing LUT schemes. Our implementation is publicly available at: https://github.com/jasonli0707/hklut.
Related papers
- Taming Lookup Tables for Efficient Image Retouching [30.48643578900116]
We propose ICELUT, which adopts LUTs for extremely efficient edge inference, without any convolutional neural network (CNN)
ICELUT achieves near-state-of-the-art performance and remarkably low power consumption.
These enable ICELUT, the first-ever purely LUT-based image enhancer, to reach an unprecedented speed of 0.4ms on GPU and 7ms on CPU, at least one order faster than any CNN solution.
arXiv Detail & Related papers (2024-03-28T08:49:35Z) - DeepGEMM: Accelerated Ultra Low-Precision Inference on CPU Architectures
using Lookup Tables [49.965024476651706]
DeepGEMM is a lookup table based approach for the execution of ultra low-precision convolutional neural networks on SIMD hardware.
Our implementation outperforms corresponding 8-bit integer kernels by up to 1.74x on x86 platforms.
arXiv Detail & Related papers (2023-04-18T15:13:10Z) - Toward DNN of LUTs: Learning Efficient Image Restoration with Multiple
Look-Up Tables [47.15181829317732]
High-definition screens on edge devices stimulate a strong demand for efficient image restoration algorithms.
The size of a single look-up table grows exponentially with the increase of its indexing capacity.
We propose a universal method to construct multiple LUTs like a neural network, termed MuLUT.
arXiv Detail & Related papers (2023-03-25T16:00:33Z) - Spatially-Adaptive Feature Modulation for Efficient Image
Super-Resolution [90.16462805389943]
We develop a spatially-adaptive feature modulation (SAFM) mechanism upon a vision transformer (ViT)-like block.
Proposed method is $3times$ smaller than state-of-the-art efficient SR methods.
arXiv Detail & Related papers (2023-02-27T14:19:31Z) - Exploiting Kernel Compression on BNNs [0.0]
In this work, we observe that the number of unique sequences representing a set of weights is typically low.
We propose a clustering scheme to identify the most common sequences of bits and replace the less common ones with some similar common sequences.
Our experimental results show that our technique can reduce memory requirement by 1.32x and improve performance by 1.35x.
arXiv Detail & Related papers (2022-12-01T16:05:10Z) - Instant Neural Graphics Primitives with a Multiresolution Hash Encoding [67.33850633281803]
We present a versatile new input encoding that permits the use of a smaller network without sacrificing quality.
A small neural network is augmented by a multiresolution hash table of trainable feature vectors whose values are optimized through a gradient descent.
We achieve a combined speed of several orders of magnitude, enabling training of high-quality neural graphics primitives in a matter of seconds.
arXiv Detail & Related papers (2022-01-16T07:22:47Z) - CREW: Computation Reuse and Efficient Weight Storage for
Hardware-accelerated MLPs and RNNs [1.0635248457021496]
We present CREW, a hardware accelerator that implements Reuse and an Efficient Weight Storage mechanism.
CREW greatly reduces the number of multiplications and provides significant savings in model memory footprint and memory bandwidth usage.
On average, CREW provides 2.61x speedup and 2.42x energy savings over a TPU-like accelerator.
arXiv Detail & Related papers (2021-07-20T11:10:54Z) - Quantized Neural Networks via {-1, +1} Encoding Decomposition and
Acceleration [83.84684675841167]
We propose a novel encoding scheme using -1, +1 to decompose quantized neural networks (QNNs) into multi-branch binary networks.
We validate the effectiveness of our method on large-scale image classification, object detection, and semantic segmentation tasks.
arXiv Detail & Related papers (2021-06-18T03:11:15Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - PoET-BiN: Power Efficient Tiny Binary Neurons [1.7274221736253095]
We propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices.
A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain.
A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses.
arXiv Detail & Related papers (2020-02-23T00:32:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.