Visual Instruction Pretraining for Domain-Specific Foundation Models
- URL: http://arxiv.org/abs/2509.17562v2
- Date: Tue, 23 Sep 2025 04:33:22 GMT
- Title: Visual Instruction Pretraining for Domain-Specific Foundation Models
- Authors: Yuxuan Li, Yicheng Zhang, Wenhao Tang, Yimian Dai, Ming-Ming Cheng, Xiang Li, Jian Yang,
- Abstract summary: We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception.<n>ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data.<n> experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance.
- Score: 57.71527725761518
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Modern computer vision is converging on a closed loop in which perception, reasoning and generation mutually reinforce each other. However, this loop remains incomplete: the top-down influence of high-level reasoning on the foundational learning of low-level perceptual features is not yet underexplored. This paper addresses this gap by proposing a new paradigm for pretraining foundation models in downstream domains. We introduce Visual insTruction Pretraining (ViTP), a novel approach that directly leverages reasoning to enhance perception. ViTP embeds a Vision Transformer (ViT) backbone within a Vision-Language Model and pretrains it end-to-end using a rich corpus of visual instruction data curated from target downstream domains. ViTP is powered by our proposed Visual Robustness Learning (VRL), which compels the ViT to learn robust and domain-relevant features from a sparse set of visual tokens. Extensive experiments on 16 challenging remote sensing and medical imaging benchmarks demonstrate that ViTP establishes new state-of-the-art performance across a diverse range of downstream tasks. The code is available at https://github.com/zcablii/ViTP.
Related papers
- Supervised Fine-tuning in turn Improves Visual Foundation Models [74.1760864718129]
Two-stage method ViSFT (Vision SFT) is proposed to unleash the fine-grained knowledge of vision foundation models.
Vision transformer with over 4.4B parameters shows improvements across various out-of-domain benchmarks.
arXiv Detail & Related papers (2024-01-18T18:58:54Z) - Denoising Vision Transformers [43.03068202384091]
We propose a two-stage denoising approach, termed Denoising Vision Transformers (DVT)
In the first stage, we separate the clean features from those contaminated by positional artifacts by enforcing cross-view feature consistency with neural fields on a per-image basis.
In the second stage, we train a lightweight transformer block to predict clean features from raw ViT outputs, leveraging the derived estimates of the clean features as supervision.
arXiv Detail & Related papers (2024-01-05T18:59:52Z) - Position-guided Text Prompt for Vision-Language Pre-training [121.15494549650548]
We propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with Vision-Language Pre-Training.
PTP reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object.
PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot.
arXiv Detail & Related papers (2022-12-19T18:55:43Z) - What do Vision Transformers Learn? A Visual Exploration [68.50771218442776]
Vision transformers (ViTs) are quickly becoming the de-facto architecture for computer vision.
This paper addresses the obstacles to performing visualizations on ViTs and explores the underlying differences between ViTs and CNNs.
We also conduct large-scale visualizations on a range of ViT variants, including DeiT, CoaT, ConViT, PiT, Swin, and Twin.
arXiv Detail & Related papers (2022-12-13T16:55:12Z) - Efficient Vision-Language Pretraining with Visual Concepts and
Hierarchical Alignment [40.677139679304936]
We propose a new framework, dubbed ViCHA, that efficiently exploits the input data to boost the learning by: (a) a new hierarchical cross-modal alignment loss, (b) new self-supervised scheme based on masked image modeling, and (c) leveraging image-level annotations.
Although pretrained on four times less data, our ViCHA strategy outperforms other approaches on several downstream tasks such as Image-Text Retrieval, VQA, Visual Reasoning, Visual Entailment and Visual Grounding.
arXiv Detail & Related papers (2022-08-29T14:24:08Z) - Self-Distilled Vision Transformer for Domain Generalization [58.76055100157651]
Vision transformers (ViTs) are challenging the supremacy of CNNs on standard benchmarks.
We propose a simple DG approach for ViTs, coined as self-distillation for ViTs.
We empirically demonstrate notable performance gains with different DG baselines and various ViT backbones in five challenging datasets.
arXiv Detail & Related papers (2022-07-25T17:57:05Z) - LocVTP: Video-Text Pre-training for Temporal Localization [71.74284893790092]
Video-Text Pre-training aims to learn transferable representations for various downstream tasks from large-scale web videos.
In this paper, we experimentally analyze and demonstrate the incompatibility of current VTP methods with localization tasks.
We propose a novel localization-oriented Video-Text Pre-training framework, dubbed as LocVTP.
arXiv Detail & Related papers (2022-07-21T08:43:51Z) - Recent Advances in Vision Transformer: A Survey and Outlook of Recent
Work [1.6317061277457001]
Vision Transformers (ViTs) are becoming more popular and dominating technique for various vision tasks, compare to Convolutional Neural Networks (CNNs)
As a demanding technique in computer vision, ViTs have been successfully solved various vision problems while focusing on long-range relationships.
We thoroughly compare the performance of various ViT algorithms and most representative CNN methods on popular benchmark datasets.
arXiv Detail & Related papers (2022-03-03T06:17:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.