UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
- URL: http://arxiv.org/abs/2412.10680v1
- Date: Sat, 14 Dec 2024 04:59:38 GMT
- Title: UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
- Authors: Haoyu Jiang, Zhi-Qi Cheng, Gabriel Moreira, Jiawen Zhu, Jingdong Sun, Bukun Ren, Jun-Yan He, Qi Dai, Xian-Sheng Hua,
- Abstract summary: Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels.<n>We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation.
- Score: 36.64936080387957
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Universal Cross-Domain Retrieval (UCDR) retrieves relevant images from unseen domains and classes without semantic labels, ensuring robust generalization. Existing methods commonly employ prompt tuning with pre-trained vision-language models but are inherently limited by static prompts, reducing adaptability. We propose UCDR-Adapter, which enhances pre-trained models with adapters and dynamic prompt generation through a two-phase training strategy. First, Source Adapter Learning integrates class semantics with domain-specific visual knowledge using a Learnable Textual Semantic Template and optimizes Class and Domain Prompts via momentum updates and dual loss functions for robust alignment. Second, Target Prompt Generation creates dynamic prompts by attending to masked source prompts, enabling seamless adaptation to unseen domains and classes. Unlike prior approaches, UCDR-Adapter dynamically adapts to evolving data distributions, enhancing both flexibility and generalization. During inference, only the image branch and generated prompts are used, eliminating reliance on textual inputs for highly efficient retrieval. Extensive benchmark experiments show that UCDR-Adapter consistently outperforms ProS in most cases and other state-of-the-art methods on UCDR, U(c)CDR, and U(d)CDR settings.
Related papers
- CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval [22.01591564940522]
We introduce a domain-agnostic pretraining framework based on Cross-modality Adaptive Meta-Learning (CAMeL) to enhance the model generalization capability.
In particular, we develop a series of tasks that reflect the diversity and complexity of real-world scenarios.
Our proposed model not only surpasses existing state-of-the-art methods on real-world benchmarks, but also showcases robustness and scalability.
arXiv Detail & Related papers (2025-04-26T03:26:30Z) - SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.
During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z) - Semantic Library Adaptation: LoRA Retrieval and Fusion for Open-Vocabulary Semantic Segmentation [72.28364940168092]
Open-vocabulary semantic segmentation models associate vision and text to label pixels from an undefined set of classes using textual queries.
We introduce Semantic Library Adaptation (SemLA), a novel framework for training-free, test-time domain adaptation.
arXiv Detail & Related papers (2025-03-27T17:59:58Z) - QueryAdapter: Rapid Adaptation of Vision-Language Models in Response to Natural Language Queries [2.306164598536725]
We present a novel framework for rapidly adapting a pre-trained VLM to respond to a natural language query.
We use unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query.
We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation.
arXiv Detail & Related papers (2025-02-26T01:07:28Z) - ProS: Prompting-to-simulate Generalized knowledge for Universal
Cross-Domain Retrieval [123.51277978744677]
We propose textbfPrompting-to-textbfSimulate (ProS) to apply prompt tuning for Universal Cross-Domain Retrieval (UCDR)
ProS employs a two-step process to simulate Content-aware Dynamic Prompts (CaDP) which can impact models to produce generalized features for UCDR.
Our method achieves new state-of-the-art performance without bringing excessive parameters.
arXiv Detail & Related papers (2023-12-19T14:39:11Z) - Learning Domain Invariant Prompt for Vision-Language Models [31.581652862478965]
We propose a novel prompt learning paradigm that directly generates emphdomain invariant prompt that can be generalized to unseen domains, called MetaPrompt.
Our method consistently and significantly outperforms existing methods.
arXiv Detail & Related papers (2022-12-08T11:23:24Z) - P{\O}DA: Prompt-driven Zero-shot Domain Adaptation [27.524962843495366]
We adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt.
We show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Refign: Align and Refine for Adaptation of Semantic Segmentation to
Adverse Conditions [78.71745819446176]
Refign is a generic extension to self-training-based UDA methods which leverages cross-domain correspondences.
Refign consists of two steps: (1) aligning the normal-condition image to the corresponding adverse-condition image using an uncertainty-aware dense matching network, and (2) refining the adverse prediction with the normal prediction using an adaptive label correction mechanism.
The approach introduces no extra training parameters, minimal computational overhead -- during training only -- and can be used as a drop-in extension to improve any given self-training-based UDA method.
arXiv Detail & Related papers (2022-07-14T11:30:38Z) - CLIP-Adapter: Better Vision-Language Models with Feature Adapters [79.52844563138493]
We show that there is an alternative path to achieve better vision-language models other than prompt tuning.
In this paper, we propose CLIP-Adapter to conduct fine-tuning with feature adapters on either visual or language branch.
Experiments and extensive ablation studies on various visual classification tasks demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2021-10-09T11:39:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.