Supervised Fine-tuning in turn Improves Visual Foundation Models
- URL: http://arxiv.org/abs/2401.10222v2
- Date: Thu, 11 Apr 2024 17:59:42 GMT
- Title: Supervised Fine-tuning in turn Improves Visual Foundation Models
- Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan,
- Abstract summary: Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
- Score: 74.1760864718129
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.
Related papers
- Building 6G Radio Foundation Models with Transformer Architectures [6.70088826174291]
Foundation deep learning (DL) models are designed to learn general, robust and adaptable representations of their target modality.
These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL)
We propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning.
arXiv Detail & Related papers (2024-11-15T07:01:44Z) - ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models.
Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios.
We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z) - Can Visual Foundation Models Achieve Long-term Point Tracking? [37.95592121632532]
We evaluate the geometric awareness of visual foundation models in the context of point tracking.
Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings.
arXiv Detail & Related papers (2024-08-24T12:58:08Z) - Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models [32.83187649097727]
We build a dataset of over four million multi-view image-text pairs across more than 100K objects.
We design a novel fine-tuning framework named Omniview-Tuning (OVT)
OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy.
arXiv Detail & Related papers (2024-04-18T12:41:33Z) - ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities.
We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG.
By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z) - Expedited Training of Visual Conditioned Language Generation via
Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models.
We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z) - PointVST: Self-Supervised Pre-training for 3D Point Clouds via
View-Specific Point-to-Image Translation [64.858505571083]
This paper proposes a translative pre-training framework, namely PointVST.
It is driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images.
arXiv Detail & Related papers (2022-12-29T07:03:29Z) - Efficient Self-supervised Vision Transformers for Representation
Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity.
We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies.
Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.