Related papers: Prompt-based Adaptation in Large-scale Vision Models: A Survey

Prompt-based Adaptation in Large-scale Vision Models: A Survey

URL: http://arxiv.org/abs/2510.13219v1
Date: Wed, 15 Oct 2025 07:14:50 GMT
Title: Prompt-based Adaptation in Large-scale Vision Models: A Survey
Authors: Xi Xiao, Yunbei Zhang, Lin Zhao, Yiyang Liu, Xiaoying Liao, Zheda Mai, Xingjian Li, Xiao Wang, Hao Xu, Jihun Hamm, Xue Lin, Min Xu, Qifan Wang, Tianyang Wang, Cheng Han,
Abstract summary: Visual Prompting (VP) and Visual Prompt Tuning (VPT) have emerged as lightweight alternatives to full fine-tuning for adapting large-scale vision models.<n>We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts.<n>We examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks.
Score: 62.09307869247613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: In computer vision, Visual Prompting (VP) and Visual Prompt Tuning (VPT) have recently emerged as lightweight and effective alternatives to full fine-tuning for adapting large-scale vision models within the ``pretrain-then-finetune'' paradigm. However, despite rapid progress, their conceptual boundaries remain blurred, as VP and VPT are frequently used interchangeably in current research, reflecting a lack of systematic distinction between these techniques and their respective applications. In this survey, we revisit the designs of VP and VPT from first principles, and conceptualize them within a unified framework termed Prompt-based Adaptation (PA). We provide a taxonomy that categorizes existing methods into learnable, generative, and non-learnable prompts, and further organizes them by injection granularity -- pixel-level and token-level. Beyond the core methodologies, we examine PA's integrations across diverse domains, including medical imaging, 3D point clouds, and vision-language tasks, as well as its role in test-time adaptation and trustworthy AI. We also summarize current benchmarks and identify key challenges and future directions. To the best of our knowledge, we are the first comprehensive survey dedicated to PA's methodologies and applications in light of their distinct characteristics. Our survey aims to provide a clear roadmap for researchers and practitioners in all area to understand and explore the evolving landscape of PA-related research.

Related papers

A Survey on Video Anomaly Detection via Deep Learning: Human, Vehicle, and Environment [2.3349787245442966]
Video Anomaly Detection (VAD) has emerged as a pivotal task in computer vision, with broad relevance across multiple fields.<n>Recent advances in deep learning have driven significant progress in this area, yet the field remains fragmented across domains and learning paradigms.<n>This survey offers a comprehensive perspective on VAD, systematically organizing the literature across various supervision levels.
arXiv Detail & Related papers (2025-08-19T18:50:49Z)
A Comprehensive Survey on Video Scene Parsing:Advances, Challenges, and Prospects [53.15503034595476]
Video Scene Parsing (VSP) has emerged as a cornerstone in computer vision.<n>VSP has emerged as a cornerstone in computer vision, facilitating the simultaneous segmentation, recognition, and tracking of diverse visual entities in dynamic scenes.
arXiv Detail & Related papers (2025-06-16T14:39:03Z)
AceVFI: A Comprehensive Survey of Advances in Video Frame Interpolation [8.563354084119062]
Video Frame Interpolation (VFI) is a fundamental Low-Level Vision (LLV) task that synthesizes intermediate frames between existing ones.<n>We introduce AceVFI, the most comprehensive survey on VFI to date, covering over 250+ papers across these approaches.<n>We categorize the learning paradigm of VFI methods namely, Center-Time Frame Interpolation (CTFI) and Arbitrary-Time Frame Interpolation (ATFI)
arXiv Detail & Related papers (2025-06-01T16:01:24Z)
An Empirical Study of Federated Prompt Learning for Vision Language Model [89.2963764404892]
This paper systematically investigates the behavioral differences between language prompt learning (VPT) and vision prompt learning (VLM)<n>We evaluate the impact of various FL and prompt configurations, such as client scale, aggregation strategies, and prompt length, to assess the robustness of Federated Prompt Learning (FPL)
arXiv Detail & Related papers (2025-05-29T03:09:15Z)
Place Recognition Meet Multiple Modalitie: A Comprehensive Review, Current Challenges and Future Directions [2.4775350526606355]
We review recent advancements in place recognition, emphasizing three methodological paradigms.<n>CNN-based approaches, Transformer-based frameworks, and cross-modal strategies are discussed.<n>We identify current research challenges and outline prospective directions, including domain adaptation, real-time performance, and lifelong learning, to inspire future advancements in this domain.
arXiv Detail & Related papers (2025-05-20T08:16:37Z)
Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts [39.92376420375139]
We propose the first framework, named Interpretable Visual Prompt Tuning, to explore interpretability for visual prompts.<n>Visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes.<n>IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities.
arXiv Detail & Related papers (2025-03-08T06:12:50Z)
Adversarial Prompt Distillation for Vision-Language Models [61.39214202062028]
Adversarial Prompt Tuning (APT) applies adversarial training during the process of prompt tuning.<n>APD is a bimodal knowledge distillation framework that enhances APT by integrating it with multi-modal knowledge transfer.<n>Extensive experiments on multiple benchmark datasets demonstrate the superiority of our APD method over the current state-of-the-art APT methods.
arXiv Detail & Related papers (2024-11-22T03:02:13Z)
Exploring the Effectiveness of Object-Centric Representations in Visual Question Answering: Comparative Insights with Foundation Models [24.579822095003685]
We conduct an empirical study on representation learning for downstream Visual Question Answering (VQA)<n>We thoroughly investigate the benefits and trade-offs of OC models and alternative approaches.<n>We identify a promising path to leverage the strengths of both paradigms.
arXiv Detail & Related papers (2024-07-22T12:26:08Z)
Anomaly Detection by Adapting a pre-trained Vision Language Model [48.225404732089515]
We present a unified framework named CLIP-ADA for Anomaly Detection by Adapting a pre-trained CLIP model. We introduce the learnable prompt and propose to associate it with abnormal patterns through self-supervised learning. We achieve the state-of-the-art 97.5/55.6 and 89.3/33.1 on MVTec-AD and VisA for anomaly detection and localization.
arXiv Detail & Related papers (2024-03-14T15:35:07Z)
Delving into Multimodal Prompting for Fine-grained Visual Classification [57.12570556836394]
Fine-grained visual classification (FGVC) involves categorizing fine subdivisions within a broader category. Recent advancements in pre-trained vision-language models have demonstrated remarkable performance in various high-level vision tasks. We propose a novel multimodal prompting solution, denoted as MP-FGVC, based on the contrastive language-image subcategory (CLIP) model.
arXiv Detail & Related papers (2023-09-16T07:30:52Z)
A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future [6.4105103117533755]
A taxonomy is first developed to organize different tasks and methodologies. The proposed taxonomy is universal across different tasks, covering object detection, semantic/instance/panoptic segmentation, 3D and video understanding.
arXiv Detail & Related papers (2023-07-18T12:52:49Z)
ViDA: Homeostatic Visual Domain Adapter for Continual Test Time Adaptation [48.039156140237615]
A Continual Test-Time Adaptation task is proposed to adapt the pre-trained model to continually changing target domains. We design a Visual Domain Adapter (ViDA) for CTTA, explicitly handling both domain-specific and domain-shared knowledge. Our proposed method achieves state-of-the-art performance in both classification and segmentation CTTA tasks.
arXiv Detail & Related papers (2023-06-07T11:18:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.