MIA-Former: Efficient and Robust Vision Transformers via Multi-grained
Input-Adaptation
- URL: http://arxiv.org/abs/2112.11542v1
- Date: Tue, 21 Dec 2021 22:06:24 GMT
- Title: MIA-Former: Efficient and Robust Vision Transformers via Multi-grained
Input-Adaptation
- Authors: Zhongzhi Yu, Yonggan Fu, Sicheng Li, Chaojian Li, Yingyan Lin
- Abstract summary: Vision Transformer (ViT) models are too computationally expensive to be fitted onto real-world resource-constrained devices.
We propose a Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs.
Experiments and ablation studies validate that the proposed MIA-Former framework can effectively allocate budgets adaptive to the difficulty of input images.
- Score: 14.866949449862226
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: ViTs are often too computationally expensive to be fitted onto real-world
resource-constrained devices, due to (1) their quadratically increased
complexity with the number of input tokens and (2) their overparameterized
self-attention heads and model depth. In parallel, different images are of
varied complexity and their different regions can contain various levels of
visual information, indicating that treating all regions/tokens equally in
terms of model complexity is unnecessary while such opportunities for trimming
down ViTs' complexity have not been fully explored. To this end, we propose a
Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former
that can input-adaptively adjust the structure of ViTs at three
coarse-to-fine-grained granularities (i.e., model depth and the number of model
heads/tokens). In particular, our MIA-Former adopts a low-cost network trained
with a hybrid supervised and reinforcement training method to skip unnecessary
layers, heads, and tokens in an input adaptive manner, reducing the overall
computational cost. Furthermore, an interesting side effect of our MIA-Former
is that its resulting ViTs are naturally equipped with improved robustness
against adversarial attacks over their static counterparts, because
MIA-Former's multi-grained dynamic control improves the model diversity similar
to the effect of ensemble and thus increases the difficulty of adversarial
attacks against all its sub-models. Extensive experiments and ablation studies
validate that the proposed MIA-Former framework can effectively allocate
computation budgets adaptive to the difficulty of input images meanwhile
increase robustness, achieving state-of-the-art (SOTA) accuracy-efficiency
trade-offs, e.g., 20% computation savings with the same or even a higher
accuracy compared with SOTA dynamic transformer models.
Related papers
- DyMU: Dynamic Merging and Virtual Unmerging for Efficient VLMs [124.52164183968145]
We present DyMU, an efficient, training-free framework that reduces the computational burden of vision-language models (VLMs)
Our approach comprises two key components. First, Dynamic Token Merging (DToMe) reduces the number of visual token embeddings by merging similar tokens based on image complexity.
Second, Virtual Token Unmerging (VTU) simulates the expected token sequence for large language models (LLMs) by efficiently reconstructing the attention dynamics of a full sequence.
arXiv Detail & Related papers (2025-04-23T18:38:18Z) - AdaViT: Adaptive Vision Transformer for Flexible Pretrain and Finetune with Variable 3D Medical Image Modalities [9.006543373916314]
We propose an adaptive Vision Transformer (AdaViT) framework capable of handling variable set of input modalities for each case.
We demonstrate that this architecture effectively transfers supervised pretrained models to new datasets with different input modality/contrast sets.
For self-supervised pretrain, the proposed method is able to maximize the pretrain data and facilitate transferring to diverse downstream tasks with variable sets of input modalities.
arXiv Detail & Related papers (2025-04-04T16:57:06Z) - Transformer Meets Twicing: Harnessing Unattended Residual Information [2.1605931466490795]
Transformer-based deep learning models have achieved state-of-the-art performance across numerous language and vision tasks.
While the self-attention mechanism has proven capable of handling complex data patterns, it has been observed that the representational capacity of the attention matrix degrades significantly across transformer layers.
We propose the Twicing Attention, a novel attention mechanism that uses kernel twicing procedure in nonparametric regression to alleviate the low-pass behavior of associated NLM smoothing.
arXiv Detail & Related papers (2025-03-02T01:56:35Z) - AdapMTL: Adaptive Pruning Framework for Multitask Learning Model [5.643658120200373]
AdapMTL is an adaptive pruning framework for multitask models.
It balances sparsity allocation and accuracy performance across multiple tasks.
It showcases superior performance compared to state-of-the-art pruning methods.
arXiv Detail & Related papers (2024-08-07T17:19:15Z) - Multi-layer Learnable Attention Mask for Multimodal Tasks [2.378535917357144]
Learnable Attention Mask (LAM) strategically designed to globally regulate attention maps and prioritize critical tokens.
LAM adeptly captures associations between tokens in BERT-like transformer network.
Comprehensive experimental validation on various datasets, such as MADv2, QVHighlights, ImageNet 1K, and MSRVTT.
arXiv Detail & Related papers (2024-06-04T20:28:02Z) - MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild [81.32127423981426]
Multimodal emotion recognition based on audio and video data is important for real-world applications.
Recent methods have focused on exploiting advances of self-supervised learning (SSL) for pre-training of strong multimodal encoders.
We propose a different perspective on the problem and investigate the advancement of multimodal DFER performance by adapting SSL-pre-trained disjoint unimodal encoders.
arXiv Detail & Related papers (2024-04-13T13:39:26Z) - MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for
Accelerating Vision-Language Transformer [66.71930982549028]
Vision-Language Transformers (VLTs) have shown great success recently, but are accompanied by heavy computation costs.
We propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs.
arXiv Detail & Related papers (2024-03-05T14:13:50Z) - FuseMoE: Mixture-of-Experts Transformers for Fleximodal Fusion [29.130355774088205]
FuseMoE is a mixture-of-experts framework incorporated with an innovative gating function.
Designed to integrate a diverse number of modalities, FuseMoE is effective in managing scenarios with missing modalities and irregularly sampled data trajectories.
arXiv Detail & Related papers (2024-02-05T17:37:46Z) - Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer [54.32283739486781]
We present a textbfForgery-aware textbfAdaptive textbfVision textbfTransformer (FA-ViT) under the adaptive learning paradigm.
FA-ViT achieves 93.83% and 78.32% AUC scores on Celeb-DF and DFDC datasets in the cross-dataset evaluation.
arXiv Detail & Related papers (2023-09-20T06:51:11Z) - Efficient Multimodal Transformer with Dual-Level Feature Restoration for
Robust Multimodal Sentiment Analysis [47.29528724322795]
Multimodal Sentiment Analysis (MSA) has attracted increasing attention recently.
Despite significant progress, there are still two major challenges on the way towards robust MSA.
We propose a generic and unified framework to address them, named Efficient Multimodal Transformer with Dual-Level Feature Restoration (EMT-DLFR)
arXiv Detail & Related papers (2022-08-16T08:02:30Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Intriguing Properties of Vision Transformers [114.28522466830374]
Vision transformers (ViT) have demonstrated impressive performance across various machine vision problems.
We systematically study this question via an extensive set of experiments and comparisons with a high-performing convolutional neural network (CNN)
We show effective features of ViTs are due to flexible receptive and dynamic fields possible via the self-attention mechanism.
arXiv Detail & Related papers (2021-05-21T17:59:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.