SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
- URL: http://arxiv.org/abs/2504.17395v1
- Date: Thu, 24 Apr 2025 09:31:08 GMT
- Title: SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting
- Authors: Yiming Zhao, Guorong Li, Laiyun Qing, Amin Beheshti, Jian Yang, Michael Sheng, Yuankai Qi, Qingming Huang,
- Abstract summary: We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
- Score: 70.49268117587562
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-world object counting leverages the robust text-image alignment of pre-trained vision-language models (VLMs) to enable counting of arbitrary categories in images specified by textual queries. However, widely adopted naive fine-tuning strategies concentrate exclusively on text-image consistency for categories contained in training, which leads to limited generalizability for unseen categories. In this work, we propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories with minimal overhead in parameters and inference time. First, we introduce a two-stage visual prompt learning strategy composed of Category-Specific Prompt Initialization (CSPI) and Topology-Guided Prompt Refinement (TGPR). The CSPI generates category-specific visual prompts, and then TGPR distills latent structural patterns from the VLM's text encoder to refine these prompts. During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories, facilitating robust text-image alignment for unseen categories. Extensive experiments integrating SDVPT with all available open-world object counting models demonstrate its effectiveness and adaptability across three widely used datasets: FSC-147, CARPK, and PUCPR+.
Related papers
- Exploring Interpretability for Visual Prompt Tuning with Hierarchical Concepts [39.92376420375139]
We propose the first framework, named Interpretable Visual Prompt Tuning, to explore interpretability for visual prompts.<n>Visual prompts are linked to human-understandable semantic concepts, represented as a set of category-agnostic prototypes.<n>IVPT aggregates features from these regions to generate interpretable prompts, which are structured hierarchically to explain visual prompts at different granularities.
arXiv Detail & Related papers (2025-03-08T06:12:50Z) - Category-Adaptive Cross-Modal Semantic Refinement and Transfer for Open-Vocabulary Multi-Label Recognition [59.203152078315235]
We propose a novel category-adaptive cross-modal semantic refinement and transfer (C$2$SRT) framework to explore the semantic correlation.<n>The proposed framework consists of two complementary modules, i.e., intra-category semantic refinement (ISR) module and inter-category semantic transfer (IST) module.<n>Experiments on OV-MLR benchmarks clearly demonstrate that the proposed C$2$SRT framework outperforms current state-of-the-art algorithms.
arXiv Detail & Related papers (2024-12-09T04:00:18Z) - Category-Prompt Refined Feature Learning for Long-Tailed Multi-Label Image Classification [8.139529179222844]
Category-Prompt Refined Feature Learning (CPRFL) is a novel approach for Long-Tailed Multi-Label image Classification.
CPRFL initializes category-prompts from the pretrained CLIP's embeddings and decouples category-specific visual representations.
We validate the effectiveness of our method on two LTMLC benchmarks and extensive experiments demonstrate the superiority of our work over baselines.
arXiv Detail & Related papers (2024-08-15T12:51:57Z) - Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image.
UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z) - Open-Vocabulary Temporal Action Localization using Multimodal Guidance [67.09635853019005]
OVTAL enables a model to recognize any desired action category in videos without the need to explicitly curate training data for all categories.
This flexibility poses significant challenges, as the model must recognize not only the action categories seen during training but also novel categories specified at inference.
We introduce OVFormer, a novel open-vocabulary framework extending ActionFormer with three key contributions.
arXiv Detail & Related papers (2024-06-21T18:00:05Z) - CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains.
We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z) - Compound Text-Guided Prompt Tuning via Image-Adaptive Cues [42.248853198953945]
We propose Compound Text-Guided Prompt Tuning (TGP-T)
It significantly reduces resource demand while achieving superior performance.
It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet.
arXiv Detail & Related papers (2023-12-11T14:17:02Z) - TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model [78.77544632773404]
We present a Textual-based Class-aware Prompt tuning( TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability.
TCP consistently achieves superior performance while demanding less training time.
arXiv Detail & Related papers (2023-11-30T03:59:23Z) - Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models [64.24227572048075]
We propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models.
Our approach takes inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects.
arXiv Detail & Related papers (2023-08-22T04:24:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.