Stitched ViTs are Flexible Vision Backbones
- URL: http://arxiv.org/abs/2307.00154v2
- Date: Tue, 28 Nov 2023 02:28:21 GMT
- Title: Stitched ViTs are Flexible Vision Backbones
- Authors: Zizheng Pan, Jing Liu, Haoyu He, Jianfei Cai, Bohan Zhuang
- Abstract summary: We are inspired by stitchable neural networks (SN-Net) to produce a single model that covers richworks by stitching pretrained model families.
We introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation.
SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone.
- Score: 51.441023711924835
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large pretrained plain vision Transformers (ViTs) have been the workhorse for
many downstream tasks. However, existing works utilizing off-the-shelf ViTs are
inefficient in terms of training and deployment, because adopting ViTs with
individual sizes requires separate trainings and is restricted by fixed
performance-efficiency trade-offs. In this paper, we are inspired by stitchable
neural networks (SN-Net), which is a new framework that cheaply produces a
single model that covers rich subnetworks by stitching pretrained model
families, supporting diverse performance-efficiency trade-offs at runtime.
Building upon this foundation, we introduce SN-Netv2, a systematically improved
model stitching framework to facilitate downstream task adaptation.
Specifically, we first propose a two-way stitching scheme to enlarge the
stitching space. We then design a resource-constrained sampling strategy that
takes into account the underlying FLOPs distributions in the space for better
sampling. Finally, we observe that learning stitching layers as a low-rank
update plays an essential role on downstream tasks to stabilize training and
ensure a good Pareto frontier. With extensive experiments on ImageNet-1K,
ADE20K, COCO-Stuff-10K and NYUv2, SN-Netv2 demonstrates superior performance
over SN-Netv1 on downstream dense predictions and shows strong ability as a
flexible vision backbone, achieving great advantages in both training
efficiency and deployment flexibility. Code is available at
https://github.com/ziplab/SN-Netv2.
Related papers
- Intra-task Mutual Attention based Vision Transformer for Few-Shot Learning [12.5354658533836]
Humans possess remarkable ability to accurately classify new, unseen images after being exposed to only a few examples.
For artificial neural network models, determining the most relevant features for distinguishing between two images with limited samples presents a challenge.
We propose an intra-task mutual attention method for few-shot learning, that involves splitting the support and query samples into patches.
arXiv Detail & Related papers (2024-05-06T02:02:57Z) - Continual Learning: Forget-free Winning Subnetworks for Video Representations [75.40220771931132]
Winning Subnetwork (WSN) in terms of task performance is considered for various continual learning tasks.
It leverages pre-existing weights from dense networks to achieve efficient learning in Task Incremental Learning (TIL) and Task-agnostic Incremental Learning (TaIL) scenarios.
The use of Fourier Subneural Operator (FSO) within WSN is considered for Video Incremental Learning (VIL)
arXiv Detail & Related papers (2023-12-19T09:11:49Z) - Efficient Stitchable Task Adaptation [47.94819192325723]
We present a novel framework, Efficient Stitchable Task Adaptation (ESTA), to efficiently produce a palette of fine-tuned models.
Specifically, we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches.
We streamline a simple yet effective one-stage deployment pipeline, which estimates the important stitches to deploy.
arXiv Detail & Related papers (2023-11-29T04:31:35Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - Stitchable Neural Networks [40.8842135978138]
We present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment.
SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another.
Experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks.
arXiv Detail & Related papers (2023-02-13T18:37:37Z) - Reinforcement Learning with Action-Free Pre-Training from Videos [95.25074614579646]
We introduce a framework that learns representations useful for understanding the dynamics via generative pre-training on videos.
Our framework significantly improves both final performances and sample-efficiency of vision-based reinforcement learning.
arXiv Detail & Related papers (2022-03-25T19:44:09Z) - Simultaneous Training of Partially Masked Neural Networks [67.19481956584465]
We show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance.
We show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone.
arXiv Detail & Related papers (2021-06-16T15:57:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.