Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning
- URL: http://arxiv.org/abs/2307.07631v1
- Date: Fri, 14 Jul 2023 21:01:59 GMT
- Title: Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning
- Authors: Davide Giacomini, Maeesha Binte Hashem, Jeremiah Suarez, Swarup
Bhunia, and Amit Ranjan Trivedi
- Abstract summary: This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
- Score: 5.41530201129053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The rapid advancement of deep neural networks has significantly improved
various tasks, such as image and speech recognition. However, as the complexity
of these models increases, so does the computational cost and the number of
parameters, making it difficult to deploy them on resource-constrained devices.
This paper proposes a novel memorization-based inference (MBI) that is compute
free and only requires lookups. Specifically, our work capitalizes on the
inference mechanism of the recurrent attention model (RAM), where only a small
window of input domain (glimpse) is processed in a one time step, and the
outputs from multiple glimpses are combined through a hidden vector to
determine the overall classification output of the problem. By leveraging the
low-dimensionality of glimpse, our inference procedure stores key value pairs
comprising of glimpse location, patch vector, etc. in a table. The computations
are obviated during inference by utilizing the table to read out key-value
pairs and performing compute-free inference by memorization. By exploiting
Bayesian optimization and clustering, the necessary lookups are reduced, and
accuracy is improved. We also present in-memory computing circuits to quickly
look up the matching key vector to an input query. Compared to competitive
compute-in-memory (CIM) approaches, MBI improves energy efficiency by almost
2.7 times than multilayer perceptions (MLP)-CIM and by almost 83 times than
ResNet20-CIM for MNIST character recognition.
Related papers
- RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.
Our evaluation shows that RetrievalAttention only needs to access 1--3% of data while maintaining high model accuracy.
arXiv Detail & Related papers (2024-09-16T17:59:52Z) - Value-Driven Mixed-Precision Quantization for Patch-Based Inference on
Microcontrollers [35.666772630923234]
QuantMCU is a novel patch-based inference method that utilizes value-driven mixed-precision quantization to reduce redundant computation.
We show that QuantMCU can reduce computation by 2.2x on average while maintaining comparable model accuracy.
arXiv Detail & Related papers (2024-01-24T04:21:41Z) - Heterogenous Memory Augmented Neural Networks [84.29338268789684]
We introduce a novel heterogeneous memory augmentation approach for neural networks.
By introducing learnable memory tokens with attention mechanism, we can effectively boost performance without huge computational overhead.
We show our approach on various image and graph-based tasks under both in-distribution (ID) and out-of-distribution (OOD) conditions.
arXiv Detail & Related papers (2023-10-17T01:05:28Z) - An Efficient Algorithm for Clustered Multi-Task Compressive Sensing [60.70532293880842]
Clustered multi-task compressive sensing is a hierarchical model that solves multiple compressive sensing tasks.
The existing inference algorithm for this model is computationally expensive and does not scale well in high dimensions.
We propose a new algorithm that substantially accelerates model inference by avoiding the need to explicitly compute these covariance matrices.
arXiv Detail & Related papers (2023-09-30T15:57:14Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Fast and Private Inference of Deep Neural Networks by Co-designing Activation Functions [26.125340303868335]
Current approaches suffer from large inference times.
We propose a novel training algorithm that gives accuracy competitive with inferences models.
Our evaluation shows between $3$ and $110times$ speedups in inference time on large models with up to $23$ million parameters.
arXiv Detail & Related papers (2023-06-14T14:38:25Z) - A Theory of I/O-Efficient Sparse Neural Network Inference [17.862408781750126]
Machine learning models increase their accuracy at a fast rate, so their demand for energy and compute resources increases.
On a low level, the major part of these resources is consumed by data movement between different memory units.
We provide a rigorous theoretical analysis of the I/Os needed in sparse feedforward neural network (FFNN) inference.
arXiv Detail & Related papers (2023-01-03T11:23:46Z) - UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
We propose a 3D medical image segmentation approach, named UNETR++, that offers both high-quality segmentation masks as well as efficiency in terms of parameters, compute cost, and inference speed.
The core of our design is the introduction of a novel efficient paired attention (EPA) block that efficiently learns spatial and channel-wise discriminative features.
Our evaluations on five benchmarks, Synapse, BTCV, ACDC, BRaTs, and Decathlon-Lung, reveal the effectiveness of our contributions in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2022-12-08T18:59:57Z) - Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence.
Despite favorable performance, calculating pairwise correlations is prohibitively costly.
This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - Decoupled and Memory-Reinforced Networks: Towards Effective Feature
Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network.
There are two major challenges in the current one-step approaches.
We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.