Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs
- URL: http://arxiv.org/abs/2307.14988v1
- Date: Thu, 27 Jul 2023 16:30:27 GMT
- Title: Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs
- Authors: Or Sharir and Anima Anandkumar
- Abstract summary: Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
- Score: 75.40636935415601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning often faces the challenge of efficiently processing dynamic
inputs, such as sensor data or user inputs. For example, an AI writing
assistant is required to update its suggestions in real time as a document is
edited. Re-running the model each time is expensive, even with compression
techniques like knowledge distillation, pruning, or quantization. Instead, we
take an incremental computing approach, looking to reuse calculations as the
inputs change. However, the dense connectivity of conventional architectures
poses a major obstacle to incremental computation, as even minor input changes
cascade through the network and restrict information reuse. To address this, we
use vector quantization to discretize intermediate values in the network, which
filters out noisy and unnecessary modifications to hidden neurons, facilitating
the reuse of their values. We apply this approach to the transformers
architecture, creating an efficient incremental inference algorithm with
complexity proportional to the fraction of the modified inputs. Our experiments
with adapting the OPT-125M pre-trained language model demonstrate comparable
accuracy on document classification while requiring 12.1X (median) fewer
operations for processing sequences of atomic edits.
Related papers
- Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference [13.000030080938078]
computational cost of transformer models makes them inefficient in low-latency or low-power applications.
We introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis.
Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
arXiv Detail & Related papers (2023-12-15T20:39:43Z) - Quantization of Deep Neural Networks to facilitate self-correction of
weights on Phase Change Memory-based analog hardware [0.0]
We develop an algorithm to approximate a set of multiplicative weights.
These weights aim to represent the original network's weights with minimal loss in performance.
Our results demonstrate that, when paired with an on-chip pulse generator, our self-correcting neural network performs comparably to those trained with analog-aware algorithms.
arXiv Detail & Related papers (2023-09-30T10:47:25Z) - Towards Model-Size Agnostic, Compute-Free, Memorization-based Inference
of Deep Learning [5.41530201129053]
This paper proposes a novel memorization-based inference (MBI) that is compute free and only requires lookups.
Specifically, our work capitalizes on the inference mechanism of the recurrent attention model (RAM)
By leveraging the low-dimensionality of glimpse, our inference procedure stores key value pairs comprising of glimpse location, patch vector, etc. in a table.
The computations are obviated during inference by utilizing the table to read out key-value pairs and performing compute-free inference by memorization.
arXiv Detail & Related papers (2023-07-14T21:01:59Z) - Training Integer-Only Deep Recurrent Neural Networks [3.1829446824051195]
We present a quantization-aware training method for obtaining a highly accurate integer-only recurrent neural network (iRNN)
Our approach supports layer normalization, attention, and an adaptive piecewise linear (PWL) approximation of activation functions.
The proposed method enables RNN-based language models to run on edge devices with $2times$ improvement in runtime.
arXiv Detail & Related papers (2022-12-22T15:22:36Z) - ZippyPoint: Fast Interest Point Detection, Description, and Matching
through Mixed Precision Discretization [71.91942002659795]
We investigate and adapt network quantization techniques to accelerate inference and enable its use on compute limited platforms.
ZippyPoint, our efficient quantized network with binary descriptors, improves the network runtime speed, the descriptor matching speed, and the 3D model size.
These improvements come at a minor performance degradation as evaluated on the tasks of homography estimation, visual localization, and map-free visual relocalization.
arXiv Detail & Related papers (2022-03-07T18:59:03Z) - SignalNet: A Low Resolution Sinusoid Decomposition and Estimation
Network [79.04274563889548]
We propose SignalNet, a neural network architecture that detects the number of sinusoids and estimates their parameters from quantized in-phase and quadrature samples.
We introduce a worst-case learning threshold for comparing the results of our network relative to the underlying data distributions.
In simulation, we find that our algorithm is always able to surpass the threshold for three-bit data but often cannot exceed the threshold for one-bit data.
arXiv Detail & Related papers (2021-06-10T04:21:20Z) - On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data.
We obtain word-level confidence scores by utilizing several types of features calculated during decoding.
The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z) - Manifold Regularized Dynamic Network Pruning [102.24146031250034]
This paper proposes a new paradigm that dynamically removes redundant filters by embedding the manifold information of all instances into the space of pruned networks.
The effectiveness of the proposed method is verified on several benchmarks, which shows better performance in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2021-03-10T03:59:03Z) - Accelerating Neural Network Inference by Overflow Aware Quantization [16.673051600608535]
Inherited heavy computation of deep neural networks prevents their widespread applications.
We propose an overflow aware quantization method by designing trainable adaptive fixed-point representation.
With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance.
arXiv Detail & Related papers (2020-05-27T11:56:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.