TaCA: Upgrading Your Visual Foundation Model with Task-agnostic
Compatible Adapter
- URL: http://arxiv.org/abs/2306.12642v1
- Date: Thu, 22 Jun 2023 03:00:24 GMT
- Title: TaCA: Upgrading Your Visual Foundation Model with Task-agnostic
Compatible Adapter
- Authors: Binjie Zhang, Yixiao Ge, Xuyuan Xu, Ying Shan, Mike Zheng Shou
- Abstract summary: A growing number of applications based on visual foundation models are emerging.
In situations involving system upgrades, it becomes essential to re-train all downstream modules to adapt to the new foundation model.
We introduce a parameter-efficient and task-agnostic adapter, dubbed TaCA, that facilitates compatibility across distinct foundation models.
- Score: 21.41170708560114
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual foundation models like CLIP excel in learning feature representations
from extensive datasets through self-supervised methods, demonstrating
remarkable transfer learning and generalization capabilities. A growing number
of applications based on visual foundation models are emerging, including
innovative solutions such as BLIP-2. These applications employ pre-trained CLIP
models as upstream feature extractors and train various downstream modules to
accomplish diverse tasks. In situations involving system upgrades that require
updating the upstream foundation model, it becomes essential to re-train all
downstream modules to adapt to the new foundation model, which is inflexible
and inefficient. In this paper, we introduce a parameter-efficient and
task-agnostic adapter, dubbed TaCA, that facilitates compatibility across
distinct foundation models while ensuring enhanced performance for the new
models. TaCA allows downstream applications to seamlessly integrate
better-performing foundation models without necessitating retraining. We
conduct extensive experimental validation of TaCA using different scales of
models with up to one billion parameters on various tasks such as video-text
retrieval, video recognition, and visual question answering. The results
consistently demonstrate the emergent ability of TaCA on hot-plugging upgrades
for visual foundation models. Codes and models will be available at
https://github.com/TencentARC/TaCA.
Related papers
- EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - A Lightweight Feature Fusion Architecture For Resource-Constrained Crowd
Counting [3.5066463427087777]
We introduce two lightweight models to enhance the versatility of crowd-counting models.
These models maintain the same downstream architecture while incorporating two distinct backbones: MobileNet and MobileViT.
We leverage Adjacent Feature Fusion to extract diverse scale features from a Pre-Trained Model (PTM) and subsequently combine these features seamlessly.
arXiv Detail & Related papers (2024-01-11T15:13:31Z) - ZhiJian: A Unifying and Rapidly Deployable Toolbox for Pre-trained Model
Reuse [59.500060790983994]
This paper introduces ZhiJian, a comprehensive and user-friendly toolbox for model reuse, utilizing the PyTorch backend.
ZhiJian presents a novel paradigm that unifies diverse perspectives on model reuse, encompassing target architecture construction with PTM, tuning target model with PTM, and PTM-based inference.
arXiv Detail & Related papers (2023-08-17T19:12:13Z) - Universal Domain Adaptation from Foundation Models: A Baseline Study [58.51162198585434]
We make empirical studies of state-of-the-art UniDA methods using foundation models.
We introduce textitCLIP distillation, a parameter-free method specifically designed to distill target knowledge from CLIP models.
Although simple, our method outperforms previous approaches in most benchmark tasks.
arXiv Detail & Related papers (2023-05-18T16:28:29Z) - Towards Efficient Task-Driven Model Reprogramming with Foundation Models [52.411508216448716]
Vision foundation models exhibit impressive power, benefiting from the extremely large model capacity and broad training data.
However, in practice, downstream scenarios may only support a small model due to the limited computational resources or efficiency considerations.
This brings a critical challenge for the real-world application of foundation models: one has to transfer the knowledge of a foundation model to the downstream task.
arXiv Detail & Related papers (2023-04-05T07:28:33Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.