Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference
- URL: http://arxiv.org/abs/2312.10193v1
- Date: Fri, 15 Dec 2023 20:39:43 GMT
- Title: Adaptive Computation Modules: Granular Conditional Computation For
Efficient Inference
- Authors: Bartosz W\'ojcik, Alessio Devoto, Karol Pustelnik, Pasquale Minervini,
Simone Scardapane
- Abstract summary: computational cost of transformer models makes them inefficient in low-latency or low-power applications.
We introduce the Adaptive Computation Module (ACM), a generic module that dynamically adapts its computational load to match the estimated difficulty of the input on a per-token basis.
Our evaluation of transformer models in computer vision and speech recognition demonstrates that substituting layers with ACMs significantly reduces inference costs without degrading the downstream accuracy for a wide interval of user-defined budgets.
- Score: 13.000030080938078
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: The computational cost of transformer models makes them inefficient in
low-latency or low-power applications. While techniques such as quantization or
linear attention can reduce the computational load, they may incur a reduction
in accuracy. In addition, globally reducing the cost for all inputs may be
sub-optimal. We observe that for each layer, the full width of the layer may be
needed only for a small subset of tokens inside a batch and that the
"effective" width needed to process a token can vary from layer to layer.
Motivated by this observation, we introduce the Adaptive Computation Module
(ACM), a generic module that dynamically adapts its computational load to match
the estimated difficulty of the input on a per-token basis. An ACM consists of
a sequence of learners that progressively refine the output of their preceding
counterparts. An additional gating mechanism determines the optimal number of
learners to execute for each token. We also describe a distillation technique
to replace any pre-trained model with an "ACMized" variant. The distillation
phase is designed to be highly parallelizable across layers while being simple
to plug-and-play into existing networks. Our evaluation of transformer models
in computer vision and speech recognition demonstrates that substituting layers
with ACMs significantly reduces inference costs without degrading the
downstream accuracy for a wide interval of user-defined budgets.
Related papers
- Adaptive Draft-Verification for Efficient Large Language Model Decoding [24.347886232342862]
Large language model (LLM) decoding involves generating a sequence of tokens based on a given context.
The typical autoregressive decoding method requires a separate forward pass through the model for each token generated.
We introduce ADED, which accelerates LLM decoding without requiring fine-tuning.
arXiv Detail & Related papers (2024-06-27T22:20:39Z) - TPC-ViT: Token Propagation Controller for Efficient Vision Transformer [6.341420717393898]
Vision transformers (ViTs) have achieved promising results on a variety of Computer Vision tasks.
Previous approaches that employ gradual token reduction to address this challenge assume that token redundancy in one layer implies redundancy in all the following layers.
We propose a novel token propagation controller (TPC) that incorporates two different token-distributions.
arXiv Detail & Related papers (2024-01-03T00:10:33Z) - Incrementally-Computable Neural Networks: Efficient Inference for
Dynamic Inputs [75.40636935415601]
Deep learning often faces the challenge of efficiently processing dynamic inputs, such as sensor data or user inputs.
We take an incremental computing approach, looking to reuse calculations as the inputs change.
We apply this approach to the transformers architecture, creating an efficient incremental inference algorithm with complexity proportional to the fraction of modified inputs.
arXiv Detail & Related papers (2023-07-27T16:30:27Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Unifying Synergies between Self-supervised Learning and Dynamic
Computation [53.66628188936682]
We present a novel perspective on the interplay between SSL and DC paradigms.
We show that it is feasible to simultaneously learn a dense and gated sub-network from scratch in a SSL setting.
The co-evolution during pre-training of both dense and gated encoder offers a good accuracy-efficiency trade-off.
arXiv Detail & Related papers (2023-01-22T17:12:58Z) - QuaLA-MiniLM: a Quantized Length Adaptive MiniLM [5.36703735486629]
Limited computational budgets often prevent transformers from being used in production and from having their high accuracy utilized.
A knowledge distillation approach addresses the computational efficiency by self-distilling BERT into a smaller transformer representation having fewer layers and smaller internal embedding.
Dynamic-TinyBERT tackles both limitations by partially implementing the Length Adaptive Transformer (LAT) technique onto TinyBERT, achieving x3 speedup over BERT-base with minimal accuracy loss.
We use MiniLM distillation jointly with the LAT method, and we further enhance the efficiency by applying low-bit quantization.
arXiv Detail & Related papers (2022-10-31T07:42:52Z) - Accelerating Part-Scale Simulation in Liquid Metal Jet Additive
Manufacturing via Operator Learning [0.0]
Part-scale predictions require many small-scale simulations.
A model describing droplet coalescence for LMJ may include coupled incompressible fluid flow, heat transfer, and phase change equations.
We apply an operator learning approach to learn a mapping between initial and final states of the droplet coalescence process.
arXiv Detail & Related papers (2022-02-02T17:24:16Z) - Adaptive Fourier Neural Operators: Efficient Token Mixers for
Transformers [55.90468016961356]
We propose an efficient token mixer that learns to mix in the Fourier domain.
AFNO is based on a principled foundation of operator learning.
It can handle a sequence size of 65k and outperforms other efficient self-attention mechanisms.
arXiv Detail & Related papers (2021-11-24T05:44:31Z) - Self Normalizing Flows [65.73510214694987]
We propose a flexible framework for training normalizing flows by replacing expensive terms in the gradient by learned approximate inverses at each layer.
This reduces the computational complexity of each layer's exact update from $mathcalO(D3)$ to $mathcalO(D2)$.
We show experimentally that such models are remarkably stable and optimize to similar data likelihood values as their exact gradient counterparts.
arXiv Detail & Related papers (2020-11-14T09:51:51Z) - Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime
with Search [84.94597821711808]
We extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training.
We conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget.
We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups.
arXiv Detail & Related papers (2020-10-14T12:28:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.