You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
Unified Vision Language Model
- URL: http://arxiv.org/abs/2211.11152v2
- Date: Mon, 3 Apr 2023 06:41:13 GMT
- Title: You Need Multiple Exiting: Dynamic Early Exiting for Accelerating
Unified Vision Language Model
- Authors: Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li,
Caiwen Ding, Yanzhi Wang, Yi Liang, Dongkuan Xu
- Abstract summary: Large-scale Transformer models bring significant improvements for various downstream vision language tasks with a unified architecture.
Performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing.
We propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously.
- Score: 37.24203191658052
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale Transformer models bring significant improvements for various
downstream vision language tasks with a unified architecture. The performance
improvements come with increasing model size, resulting in slow inference speed
and increased cost for severing. While some certain predictions benefit from
the full complexity of the large-scale model, not all of inputs need the same
amount of computation to conduct, potentially leading to computation resource
waste. To handle this challenge, early exiting is proposed to adaptively
allocate computational power in term of input complexity to improve inference
efficiency. The existing early exiting strategies usually adopt output
confidence based on intermediate layers as a proxy of input complexity to incur
the decision of skipping following layers. However, such strategies cannot
apply to encoder in the widely-used unified architecture with both encoder and
decoder due to difficulty of output confidence estimation in the encoder. It is
suboptimal in term of saving computation power to ignore the early exiting in
encoder component. To handle this challenge, we propose a novel early exiting
strategy for unified visual language models, which allows dynamically skip the
layers in encoder and decoder simultaneously in term of input layer-wise
similarities with multiple times of early exiting, namely \textbf{MuE}. By
decomposing the image and text modalities in the encoder, MuE is flexible and
can skip different layers in term of modalities, advancing the inference
efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS
COCO datasets show that the proposed approach MuE can reduce expected inference
time by up to 50\% and 40\% while maintaining 99\% and 96\% performance
respectively.
Related papers
- SparseTem: Boosting the Efficiency of CNN-Based Video Encoders by Exploiting Temporal Continuity [15.872209884833977]
We propose a memory-efficient scheduling method to eliminate memory overhead and an online adjustment mechanism to minimize accuracy degradation.
SparseTem achieves speedup of 1.79x for EfficientDet and 4.72x for CRNN, with minimal accuracy drop and no additional memory overhead.
arXiv Detail & Related papers (2024-10-28T07:13:25Z) - Dynamic layer selection in decoder-only transformers [21.18795712840146]
We empirically examine two common dynamic inference methods for natural language generation.
We find that a pre-trained decoder-only model is significantly more robust to layer removal via layer skipping.
We also show that dynamic computation allocation on a per-sequence basis holds promise for significant efficiency gains.
arXiv Detail & Related papers (2024-10-26T00:44:11Z) - Lean Attention: Hardware-Aware Scalable Attention Mechanism for the Decode-Phase of Transformers [4.674454841332859]
Transformer-based models have emerged as one of the most widely used architectures for natural language processing.
These huge models are memory hungry and incur significant inference latency even on cutting edge AI-accelerators.
We propose LeanAttention, a scalable technique of computing self-attention for the token-generation phase.
arXiv Detail & Related papers (2024-05-17T00:52:39Z) - Efficient Transformer Encoders for Mask2Former-style models [57.54752243522298]
ECO-M2F is a strategy to self-select the number of hidden layers in the encoder conditioned on the input image.
The proposed approach reduces expected encoder computational cost while maintaining performance.
It is flexible in architecture configurations, and can be extended beyond the segmentation task to object detection.
arXiv Detail & Related papers (2024-04-23T17:26:34Z) - Progressive Token Length Scaling in Transformer Encoders for Efficient Universal Segmentation [67.85309547416155]
A powerful architecture for universal segmentation relies on transformers that encode multi-scale image features and decode object queries into mask predictions.
Mask2Former uses 50% of its compute only on the transformer encoder.
This is due to the retention of a full-length token-level representation of all backbone feature scales at each encoder layer.
We propose PRO-SCALE to reduce computations by a large margin with minimal sacrifice in performance.
arXiv Detail & Related papers (2024-04-23T01:34:20Z) - Extreme Encoder Output Frame Rate Reduction: Improving Computational
Latencies of Large End-to-End Models [59.57732929473519]
We apply multiple frame reduction layers in the encoder to compress encoder outputs into a small number of output frames.
We demonstrate that we can generate one encoder output frame for every 2.56 sec of input speech, without significantly affecting word error rate on a large-scale voice search task.
arXiv Detail & Related papers (2024-02-27T03:40:44Z) - Decoder Tuning: Efficient Language Understanding as Decoding [84.68266271483022]
We present Decoder Tuning (DecT), which in contrast optimize task-specific decoder networks on the output side.
By gradient-based optimization, DecT can be trained within several seconds and requires only one P query per sample.
We conduct extensive natural language understanding experiments and show that DecT significantly outperforms state-of-the-art algorithms with a $200times$ speed-up.
arXiv Detail & Related papers (2022-12-16T11:15:39Z) - Confident Adaptive Language Modeling [95.45272377648773]
CALM is a framework for dynamically allocating different amounts of compute per input and generation timestep.
We demonstrate the efficacy of our framework in reducing compute -- potential speedup of up to $times 3$ -- while provably maintaining high performance.
arXiv Detail & Related papers (2022-07-14T17:00:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.