Related papers: Supervised Fine-tuning in turn Improves Visual Foundation Models

Supervised Fine-tuning in turn Improves Visual Foundation Models

URL: http://arxiv.org/abs/2401.10222v2
Date: Thu, 11 Apr 2024 17:59:42 GMT
Title: Supervised Fine-tuning in turn Improves Visual Foundation Models
Authors: Xiaohu Jiang, Yixiao Ge, Yuying Ge, Dachuan Shi, Chun Yuan, Ying Shan,
Abstract summary: Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
Score: 74.1760864718129
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Image-text training like CLIP has dominated the pretraining of vision foundation models in recent years. Subsequent efforts have been made to introduce region-level visual learning into CLIP's pretraining but face scalability challenges due to the lack of large-scale region-level datasets. Drawing inspiration from supervised fine-tuning (SFT) in natural language processing such as instruction tuning, we explore the potential of fine-grained SFT in enhancing the generation of vision foundation models after their pretraining. Thus a two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models. In ViSFT, the vision foundation model is enhanced by performing visual joint learning on some in-domain tasks and then tested on out-of-domain benchmarks. With updating using ViSFT on 8 V100 GPUs in less than 2 days, a vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks including vision and vision-linguistic scenarios.

Related papers

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models [42.79282247484499]
Vision-language models (VLMs) have made substantial progress across a wide range of visual question answering benchmarks, spanning visual reasoning, document understanding, and multimodal dialogue.<n>Recent works show that these models trail behind in traditional image classification benchmarks, which test fine-grained visual knowledge.<n>We test a large number of recent VLMs on fine-grained classification benchmarks and identify potential factors in the disconnect between fine-grained knowledge and other vision benchmarks.
arXiv Detail & Related papers (2026-02-19T22:07:29Z)
VESSA: Video-based objEct-centric Self-Supervised Adaptation for Visual Foundation Models [0.18665975431697424]
Foundation models have advanced computer vision by enabling strong performance across diverse tasks through large-scale pretraining and supervised fine-tuning.<n>We introduce a novel formulation of self-supervised fine-tuning for vision foundation models, where the model is adapted to a new domain without requiring annotations.<n>Our method is referred to as VESSA: Video-based objEct-centric Self-Supervised Adaptation for visual foundation models.
arXiv Detail & Related papers (2025-10-23T20:44:28Z)
Visual Instruction Pretraining for Domain-Specific Foundation Models [57.71527725761518]
We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception.<n>ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data.<n> experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance.
arXiv Detail & Related papers (2025-09-22T10:57:42Z)
UniViTAR: Unified Vision Transformer with Native Resolution [37.63387029787732]
We introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario. A progressive training paradigm is introduced, which strategically combines two core mechanisms. In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model.
arXiv Detail & Related papers (2025-04-02T14:59:39Z)
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning [12.728451197053321]
We propose Curriculum Reinforcement Finetuning (Curr-ReFT), a novel post-training paradigm specifically designed for small-scale vision-language models (VLMs) Curr-ReFT comprises two sequential stages: Curriculum Reinforcement Learning and Rejected Sampling-based Self-improvement. Our experiments demonstrate that models trained with Curr-ReFT paradigm achieve state-of-the-art performance across various visual tasks.
arXiv Detail & Related papers (2025-03-10T08:48:50Z)
Building 6G Radio Foundation Models with Transformer Architectures [6.70088826174291]
Foundation deep learning (DL) models are designed to learn general, robust and adaptable representations of their target modality. These models are pretrained on large, unlabeled datasets using self-supervised learning (SSL) We propose and demonstrate the effectiveness of a Vision Transformer (ViT) as a radio foundation model for spectrogram learning.
arXiv Detail & Related papers (2024-11-15T07:01:44Z)
ReVLA: Reverting Visual Domain Limitation of Robotic Foundation Models [55.07988373824348]
We study the visual generalization capabilities of three existing robotic foundation models. Our study shows that the existing models do not exhibit robustness to visual out-of-domain scenarios. We propose a gradual backbone reversal approach founded on model merging.
arXiv Detail & Related papers (2024-09-23T17:47:59Z)
Can Visual Foundation Models Achieve Long-term Point Tracking? [37.95592121632532]
We evaluate the geometric awareness of visual foundation models in the context of point tracking. Our findings indicate that features from Stable Diffusion and DINOv2 exhibit superior geometric correspondence abilities in zero-shot settings.
arXiv Detail & Related papers (2024-08-24T12:58:08Z)
Omniview-Tuning: Boosting Viewpoint Invariance of Vision-Language Pre-training Models [32.83187649097727]
We build a dataset of over four million multi-view image-text pairs across more than 100K objects. We design a novel fine-tuning framework named Omniview-Tuning (OVT) OVT introduces a Cross-Viewpoint Alignment objective through a minimax-like optimization strategy.
arXiv Detail & Related papers (2024-04-18T12:41:33Z)
ViT-Lens: Towards Omni-modal Representations [64.66508684336614]
ViT-Lens-2 is a framework for representation learning of increasing modalities. We show that ViT-Lens-2 can learn representations for 3D point cloud, depth, audio, tactile and EEG. By seamlessly integrating ViT-Lens-2 into Multimodal Foundation Models, we enable Any-modality to Text and Image Generation.
arXiv Detail & Related papers (2023-11-27T18:52:09Z)
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction [61.16125290912494]
$textEVL_textGen$ is a framework designed for the pre-training of visually conditioned language generation models. We show that our approach accelerates the training of vision-language models by a factor of 5 without a noticeable impact on overall performance.
arXiv Detail & Related papers (2023-10-05T03:40:06Z)
PointVST: Self-Supervised Pre-training for 3D Point Clouds via View-Specific Point-to-Image Translation [64.858505571083]
This paper proposes a translative pre-training framework, namely PointVST. It is driven by a novel self-supervised pretext task of cross-modal translation from 3D point clouds to their corresponding diverse forms of 2D rendered images.
arXiv Detail & Related papers (2022-12-29T07:03:29Z)
Efficient Self-supervised Vision Transformers for Representation Learning [86.57557009109411]
We show that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity. We propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation.
arXiv Detail & Related papers (2021-06-17T19:57:33Z)

This list is automatically generated from the titles and abstracts of the papers in this site.