Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference
- URL: http://arxiv.org/abs/2312.10193v1
- Date: Fri, 15 Dec 2023 20:39:43 GMT
- Title: Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference
- Authors: Bartosz W\'ojcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini,
Simone Scardapane
- Abstract summary: computational cost of transformer models makes them inefficient in low-latency or low-power applications.
We introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis.
Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
- Score: 13.000030080938078
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The computational cost of transformer models makes them inefficient in
low-latency or low-power applications. While techniques such as quantization or
linear attention can reduce the computational load, they may incur a reduction
in accuracy. In addition, globally reducing the cost for all inputs may be
sub-optimal. We observe that for each layer, the full width of the layer may be
needed only for a small subset of tokens inside a batch and that the
"effective" width needed to process a token can vary from layer to layer.
Motivated by this observation, we introduce the Adaptive Computation Module
(ACM), a generic module that dynamically adapts its computational load to match
the estimated difficulty of the input on a per-token basis. An ACM consists of
a sequence of learners that progressively refine the output of their preceding
counterparts. An additional gating mechanism determines the optimal number of
learners to execute for each token. We also describe a distillation technique
to replace any pre-trained model with an "ACMized" variant. The distillation
phase is designed to be highly parallelizable across layers while being simple
to plug-and-play into existing networks. Our evaluation of transformer models
in computer vision and speech recognition demonstrates that substituting layers
with ACMs significantly reduces inference costs without degrading the
downstream accuracy for a wide interval of user-defined budgets.
Related papers
- SLaNC: Static LayerNorm Calibration [1.2016264781280588]
Quantization to lower precision formats naturally poses a number of challenges caused by the limited range of the available value representations.
In this article, we propose a computationally-efficient scaling technique that can be easily applied to Transformer models during inference.
Our method suggests a straightforward way of scaling the LayerNorm inputs based on the static weights of the immediately preceding linear layers.
arXiv Detail & Related papers (2024-10-14T14:32:55Z) - Accelerating Error Correction Code Transformers [56.75773430667148]
We introduce a novel acceleration method for transformer-based decoders.
We achieve a 90% compression ratio and reduce arithmetic operation energy consumption by at least 224 times on modern hardware.
arXiv Detail & Related papers (2024-10-08T11:07:55Z) - Duo-LLM: A Framework for Studying Adaptive Computation in Large Language Models [16.16372459671255]
Large Language Models (LLMs) typically generate outputs token by token using a fixed compute budget.
We propose a novel framework that integrates smaller auxiliary modules within each Feed-Forward Network layer of the LLM.
We show that trained routers operate differently from oracles and often yield suboptimal solutions.
arXiv Detail & Related papers (2024-10-01T16:10:21Z) - Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
We show that a novel fast approximation method can calculate the gradients in almost linear time.
By improving the efficiency of gradient, we hope that this work will facilitate more effective training and deployment of long-context language models.
arXiv Detail & Related papers (2024-08-23T17:16:43Z) - TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks.
Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers.
We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer.
This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$.
We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.