Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation
- URL: http://arxiv.org/abs/2306.16658v1
- Date: Thu, 29 Jun 2023 03:39:35 GMT
- Title: Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation
- Authors: Jiaxing Huang, Jingyi Zhang, Han Qiu, Sheng Jin, Shijian Lu
- Abstract summary: We study open-vocabulary domain adaptation (OVDA), a new unsupervised domain adaptation framework.
We design a Prompt Ensemble Self-training (PEST) technique that exploits the synergy between vision and language.
PEST outperforms the state-of-the-art consistently across 10 image recognition tasks.
- Score: 45.02052030837188
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Traditional domain adaptation assumes the same vocabulary across source and
target domains, which often struggles with limited transfer flexibility and
efficiency while handling target domains with different vocabularies. Inspired
by recent vision-language models (VLMs) that enable open-vocabulary visual
recognition by reasoning on both images and texts, we study open-vocabulary
domain adaptation (OVDA), a new unsupervised domain adaptation framework that
positions a pre-trained VLM as the source model and transfers it towards
arbitrary unlabelled target domains. To this end, we design a Prompt Ensemble
Self-training (PEST) technique that exploits the synergy between vision and
language to mitigate the domain discrepancies in image and text distributions
simultaneously. Specifically, PEST makes use of the complementary property of
multiple prompts within and across vision and language modalities, which
enables joint exploitation of vision and language information and effective
learning of image-text correspondences in the unlabelled target domains.
Additionally, PEST captures temporal information via temporal prompt ensemble
which helps memorize previously learnt target information. Extensive
experiments show that PEST outperforms the state-of-the-art consistently across
10 image recognition tasks.
Related papers
- WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation.
We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart.
We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z) - Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation [27.695825570272874]
Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains.
We propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics.
Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches.
arXiv Detail & Related papers (2024-03-05T12:06:48Z) - VLLaVO: Mitigating Visual Gap through LLMs [7.352822795984628]
Cross-domain learning aims at extracting domain-invariant knowledge to reduce the domain shift between training and testing data.
We propose VLLaVO, combining Vision language models and Large Language models as Visual cross-dOmain learners.
arXiv Detail & Related papers (2024-01-06T16:33:39Z) - Domain Prompt Learning with Quaternion Networks [49.45309818782329]
We propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of Vision-Language Models to specialized domains.
We present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features.
Our proposed method achieves new state-of-the-art results in prompt learning.
arXiv Detail & Related papers (2023-12-12T08:49:39Z) - OV-VG: A Benchmark for Open-Vocabulary Visual Grounding [33.02137080950678]
This research endeavor introduces novel and challenging open-vocabulary visual tasks.
The overarching aim is to establish connections between language descriptions and the localization of novel objects.
We have curated a benchmark, encompassing 7,272 OV-VG images and 1,000 OV-PL images.
arXiv Detail & Related papers (2023-10-22T17:54:53Z) - Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms.
We propose a textbfDomain-Controlled Prompt Learning for the specific domains.
Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z) - Improving Generalization of Image Captioning with Unsupervised Prompt
Learning [63.26197177542422]
Generalization of Image Captioning (GeneIC) learns a domain-specific prompt vector for the target domain without requiring annotated data.
GeneIC aligns visual and language modalities with a pre-trained Contrastive Language-Image Pre-Training (CLIP) model.
arXiv Detail & Related papers (2023-08-05T12:27:01Z) - Fine-Grained Semantically Aligned Vision-Language Pre-Training [151.7372197904064]
Large-scale vision-language pre-training has shown impressive advances in a wide range of downstream tasks.
Existing methods mainly model the cross-modal alignment by the similarity of the global representations of images and texts.
We introduce LO, a fine-grained semantically aLigned visiOn-langUage PrE-training framework, which learns fine-grained semantic alignment from the novel perspective of game-theoretic interactions.
arXiv Detail & Related papers (2022-08-04T07:51:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.