Related papers: AttriPrompt: Dynamic Prompt Composition Learning for CLIP

AttriPrompt: Dynamic Prompt Composition Learning for CLIP

URL: http://arxiv.org/abs/2509.05949v1
Date: Sun, 07 Sep 2025 07:07:59 GMT
Title: AttriPrompt: Dynamic Prompt Composition Learning for CLIP
Authors: Qiqi Zhan, Shiwei Li, Qingjie Liu, Yunhong Wang,
Abstract summary: AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
Score: 41.37140060183439
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The evolution of prompt learning methodologies has driven exploration of deeper prompt designs to enhance model performance. However, current deep text prompting approaches suffer from two critical limitations: Over-reliance on constrastive learning objectives that prioritize high-level semantic alignment, neglecting fine-grained feature optimization; Static prompts across all input categories, preventing content-aware adaptation. To address these limitations, we propose AttriPrompt-a novel framework that enhances and refines textual semantic representations by leveraging the intermediate-layer features of CLIP's vision encoder. We designed an Attribute Retrieval module that first clusters visual features from each layer. The aggregated visual features retrieve semantically similar prompts from a prompt pool, which are then concatenated to the input of every layer in the text encoder. Leveraging hierarchical visual information embedded in prompted text features, we introduce Dual-stream Contrastive Learning to realize fine-grained alignment. Furthermore, we introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features to prevent overfitting on limited training data. Extensive experiments across three benchmarks demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37\% improvement in the base-to-novel setting. The observed strength of our method in cross-domain knowledge transfer positions vision-language pre-trained models as more viable solutions for real-world implementation.

Related papers

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning [0.9558392439655014]
We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
arXiv Detail & Related papers (2025-10-28T11:39:22Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
Hierarchical Cross-modal Prompt Learning for Vision-Language Models [9.128564580725627]
HiCroPL is a Hierarchical Cross-modal Prompt Learning framework.<n>It routes knowledge flows by leveraging the complementary strengths of text and vision.<n>It achieves state-of-the-art results on 11 benchmarks with significant improvements.
arXiv Detail & Related papers (2025-07-20T14:18:04Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Advancing Prompt Learning through an External Layer [24.77977865016954]
We propose a paradigm called EnPrompt with a novel External Layer (EnLa) The learnable external layer is built upon valid embeddings of pre-trained CLIP. Four experiments demonstrate that our method outperforms the existing prompt learning method.
arXiv Detail & Related papers (2024-07-29T03:30:09Z)
Attend and Enrich: Enhanced Visual Prompt for Zero-Shot Learning [114.59476118365266]
We propose AENet, which endows semantic information into the visual prompt to distill semantic-enhanced prompt for visual representation enrichment.<n> AENet comprises two key steps: 1) exploring the concept-harmonized tokens for the visual and attribute modalities, grounded on the modal-sharing token that represents consistent visual-semantic concepts; and 2) yielding semantic-enhanced prompt via the visual residual refinement unit with attribute consistency supervision.
arXiv Detail & Related papers (2024-06-05T07:59:48Z)
CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing [66.6712018832575]
Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. We make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features.
arXiv Detail & Related papers (2024-03-21T11:58:50Z)
DPL: Decoupled Prompt Learning for Vision-Language Models [41.90997623029582]
We propose a new method, Decoupled Prompt Learning, which reformulates the attention in prompt learning to alleviate this problem. Our approach is flexible for both visual and textual modalities, making it easily extendable to multi-modal prompt learning.
arXiv Detail & Related papers (2023-08-19T15:48:38Z)
CPL: Counterfactual Prompt Learning for Vision and Language Models [76.18024920393245]
This paper presents a novel underlinetextbfCounterfactual underlinetextbfPrompt underlinetextbfLearning (CPL) method for vision and language models. CPL simultaneously employs counterfactual generation and contrastive learning in a joint optimization framework. Experiments demonstrate that CPL can obtain superior few-shot performance on different vision and language tasks.
arXiv Detail & Related papers (2022-10-19T08:06:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.