Related papers: Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models

URL: http://arxiv.org/abs/2508.17417v1
Date: Sun, 24 Aug 2025 15:45:22 GMT
Title: Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models
Authors: Xiaojie Yin, Qilong Wang, Qinghua Hu,
Abstract summary: Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
Score: 57.357091028792325
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment due to domain gaps between pre-training and downstream tasks. Existing approaches primarily focus on text prompting with class-specific descriptions and visual-text adaptation via aligning cropped image regions with textual descriptions. However, they still face the issues of incomplete textual prompts and noisy visual prompts. In this paper, we propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment by constructing comprehensive textual prompts and compact visual prompts from the semantic perspective. Specifically, our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS). Textually, to address the issue of incomplete semantic expression in textual prompts, our TGSSG first generates synonymous semantic set for each category via large language models, and constructs comprehensive textual prompts based on semantic ambiguity entropy and persistent homology analysis. Visually, to mitigate the irrelevant visual noise introduced by random cropping, our CADRS identifies discriminative regions with activation maps outputted by a pre-trained vision model, effectively filtering out noisy regions and generating compact visual prompts. Given the comprehensive set of textual prompts and compact set of visual prompts, we introduce two set-to-set matching strategies based on test-time adaptation (TTA) and optimal transport (OT) to achieve effective visual-textual alignment, and so improve zero-shot generalization of VLMs.

Related papers

AttriPrompt: Dynamic Prompt Composition Learning for CLIP [41.37140060183439]
AttriPrompt is a novel framework that enhances and refines textual semantic representations.<n>We introduce a Self-Regularization mechanism by applying explicit regularization constraints between the prompted and non-prompted text features.<n>Experiments demonstrate AttriPrompt's superiority over state-of-the-art methods, achieving up to 7.37% improvement in the base-to-novel setting.
arXiv Detail & Related papers (2025-09-07T07:07:59Z)
Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment [33.152772648399846]
We propose a novel approach to enrich semantic representations in vision-language contrastive learning.<n>We leverage a pretrained LLM as the text encoder within the CLIP framework, processing all prompts jointly in a single forward pass.<n>The resulting prompt embeddings are combined into a unified text representation, enabling semantically richer alignment with visual features.
arXiv Detail & Related papers (2025-08-03T20:48:43Z)
Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization [75.88719716002014]
Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains.<n>Recent advances in pre-trained Visual Foundation Models (VFMs) have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models.<n>We propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM.
arXiv Detail & Related papers (2025-07-03T03:52:37Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation [14.82606425343802]
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations.<n>Existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment.<n>We propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information.
arXiv Detail & Related papers (2024-12-26T02:12:37Z)
VGSG: Vision-Guided Semantic-Group Network for Text-based Person Search [51.9899504535878]
We propose a Vision-Guided Semantic-Group Network (VGSG) for text-based person search. In VGSG, a vision-guided attention is employed to extract visual-related textual features. With the help of relational knowledge transfer, VGKT is capable of aligning semantic-group textual features with corresponding visual features.
arXiv Detail & Related papers (2023-11-13T17:56:54Z)
Direction-Oriented Visual-semantic Embedding Model for Remote Sensing Image-text Retrieval [7.118271398274512]
We propose a novel Direction-Oriented Visual-semantic Embedding Model (DOVE) to mine the relationship between vision and language. Our highlight is to conduct visual and textual representations in latent space, directing them as close as possible to a redundancy-free regional visual representation. We exploit a global visual-semantic constraint to reduce single visual dependency and serve as an external constraint for the final visual and textual representations.
arXiv Detail & Related papers (2023-10-12T12:28:47Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks. Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts. We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.