Domain Adaptation with a Single Vision-Language Embedding
- URL: http://arxiv.org/abs/2410.21361v1
- Date: Mon, 28 Oct 2024 17:59:53 GMT
- Title: Domain Adaptation with a Single Vision-Language Embedding
- Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick PĂ©rez, Raoul de Charette,
- Abstract summary: We present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data.
We show that these mined styles can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation.
- Score: 45.93202559299953
- License:
- Abstract: Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero-shot and one-shot settings.
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - Phrase Grounding-based Style Transfer for Single-Domain Generalized
Object Detection [109.58348694132091]
Single-domain generalized object detection aims to enhance a model's generalizability to multiple unseen target domains.
This is a practical yet challenging task as it requires the model to address domain shift without incorporating target domain data into training.
We propose a novel phrase grounding-based style transfer approach for the task.
arXiv Detail & Related papers (2024-02-02T10:48:43Z) - The Unreasonable Effectiveness of Large Language-Vision Models for
Source-free Video Domain Adaptation [56.61543110071199]
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset.
Previous approaches have attempted to address SFVUDA by leveraging self-supervision derived from the target data itself.
We take an approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift.
arXiv Detail & Related papers (2023-08-17T18:12:05Z) - IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories.
We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens.
Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z) - P{\O}DA: Prompt-driven Zero-shot Domain Adaptation [27.524962843495366]
We adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt.
We show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation.
arXiv Detail & Related papers (2022-12-06T18:59:58Z) - Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary
Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD)
During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task.
Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.