EA-ViT: Efficient Adaptation for Elastic Vision Transformer
- URL: http://arxiv.org/abs/2507.19360v1
- Date: Fri, 25 Jul 2025 15:11:09 GMT
- Title: EA-ViT: Efficient Adaptation for Elastic Vision Transformer
- Authors: Chen Zhu, Wangbo Zhao, Huiwen Zhang, Samir Khaki, Yuhao Zhou, Weidong Tang, Shuo Wang, Zhihang Yuan, Yuzhang Shang, Xiaojiang Peng, Kai Wang, Dawei Yang,
- Abstract summary: Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks.<n> deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive.<n>We propose an efficient ViT adaptation framework that enables a single process to generate multiple models of varying sizes for deployment on platforms with various resource constraints.
- Score: 37.442677800641164
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have emerged as a foundational model in computer vision, excelling in generalization and adaptation to downstream tasks. However, deploying ViTs to support diverse resource constraints typically requires retraining multiple, size-specific ViTs, which is both time-consuming and energy-intensive. To address this issue, we propose an efficient ViT adaptation framework that enables a single adaptation process to generate multiple models of varying sizes for deployment on platforms with various resource constraints. Our approach comprises two stages. In the first stage, we enhance a pre-trained ViT with a nested elastic architecture that enables structural flexibility across MLP expansion ratio, number of attention heads, embedding dimension, and network depth. To preserve pre-trained knowledge and ensure stable adaptation, we adopt a curriculum-based training strategy that progressively increases elasticity. In the second stage, we design a lightweight router to select submodels according to computational budgets and downstream task demands. Initialized with Pareto-optimal configurations derived via a customized NSGA-II algorithm, the router is then jointly optimized with the backbone. Extensive experiments on multiple benchmarks demonstrate the effectiveness and versatility of EA-ViT. The code is available at https://github.com/zcxcf/EA-ViT.
Related papers
- MOR-VIT: Efficient Vision Transformer with Mixture-of-Recursions [1.0411839100853515]
MoR-ViT is a novel vision transformer framework that incorporates a token-level dynamic recursion mechanism.<n>Experiments on ImageNet-1K and transfer benchmarks demonstrate that MoR-ViT achieves state-of-the-art accuracy with up to 70% parameter reduction and 2.5x inference acceleration.
arXiv Detail & Related papers (2025-07-29T12:46:36Z) - ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages [0.0]
Vision Transformers (ViTs) have revolutionized computer vision by leveraging self-attention to model long-range dependencies.<n>We propose the Efficient Convolutional Vision Transformer (ECViT), a hybrid architecture that effectively combines the strengths of CNNs and Transformers.
arXiv Detail & Related papers (2025-04-21T03:00:17Z) - ConfigX: Modular Configuration for Evolutionary Algorithms via Multitask Reinforcement Learning [24.66031883340297]
We introduce ConfigX, a new paradigm of the MetaBBO framework that is capable of learning a universal configuration agent (model) for boosting diverse EAs.<n>Our ConfigX, after large-scale pre-training, achieves robust zero-shot generalization to unseen tasks and outperforms state-of-the-art baselines.
arXiv Detail & Related papers (2024-12-10T13:43:51Z) - Slicing Vision Transformer for Flexible Inference [79.35046907288518]
We propose a general framework, named Scala, to enable a single network to represent multiple smaller ViTs.<n>S Scala achieves an average improvement of 1.6% on ImageNet-1K with fewer parameters.
arXiv Detail & Related papers (2024-12-06T05:31:42Z) - PRANCE: Joint Token-Optimization and Structural Channel-Pruning for Adaptive ViT Inference [44.77064952091458]
PRANCE is a Vision Transformer compression framework that jointly optimize the activated channels and reduces tokens, based on the characteristics of inputs.
We introduce a novel "Result-to-Go" training mechanism that models ViTs' inference process as a sequential decision process.
Our framework is shown to be compatible with various token optimization techniques such as pruning, merging, and pruning-merging strategies.
arXiv Detail & Related papers (2024-07-06T09:04:27Z) - Flextron: Many-in-One Flexible Large Language Model [85.93260172698398]
We introduce Flextron, a network architecture and post-training model optimization framework supporting flexible model deployment.
We present a sample-efficient training method and associated routing algorithms for transforming an existing trained LLM into a Flextron model.
We demonstrate superior performance over multiple end-to-end trained variants and other state-of-the-art elastic networks, all with a single pretraining run that consumes a mere 7.63% tokens compared to original pretraining.
arXiv Detail & Related papers (2024-06-11T01:16:10Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Stitched ViTs are Flexible Vision Backbones [51.441023711924835]
We are inspired by stitchable neural networks (SN-Net) to produce a single model that covers richworks by stitching pretrained model families.
We introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation.
SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone.
arXiv Detail & Related papers (2023-06-30T22:05:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.