Related papers: Domain Adaptation with a Single Vision-Language Embedding

Domain Adaptation with a Single Vision-Language Embedding

URL: http://arxiv.org/abs/2410.21361v1
Date: Mon, 28 Oct 2024 17:59:53 GMT
Title: Domain Adaptation with a Single Vision-Language Embedding
Authors: Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette,
Abstract summary: We present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. We show that these mined styles can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation.
Score: 45.93202559299953
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in some uncommon conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the zero-shot and one-shot settings.

Related papers

CoDoL: Conditional Domain Prompt Learning for Out-of-Distribution Generalization [29.68273957414245]
This paper proposes a novel Conditional Domain prompt Learning (CoDoL) method to improve OOD generalization performance.<n>To capture both instance-specific and domain-specific information, we propose a lightweight Domain Meta Network (DMN) to generate input-conditional tokens for images in each domain.
arXiv Detail & Related papers (2025-09-18T18:23:59Z)
Target-Oriented Single Domain Generalization [27.182037614828968]
Deep models trained on a single source domain often fail catastrophically under distribution shifts.<n>We propose Target-Oriented Single Domain Generalization, a novel problem setup that leverages the textual description of the target domain.<n>We introduce Spectral TARget Alignment (STAR), a module that injects target semantics into source features.
arXiv Detail & Related papers (2025-08-30T04:21:48Z)
Weakly-Supervised Image Forgery Localization via Vision-Language Collaborative Reasoning Framework [16.961220047066792]
ViLaCo is a vision-language collaborative reasoning framework that introduces auxiliary semantic supervision distilled from pre-trained vision-language models.<n>ViLaCo substantially outperforms existing WSIFL methods, achieving state-of-the-art performance in both detection and localization accuracy.
arXiv Detail & Related papers (2025-08-02T12:14:29Z)
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the learning of global and local prompts, effectively detecting abnormal patterns across various domains. The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets from both the industrial and medical domains.
arXiv Detail & Related papers (2024-11-09T05:22:13Z)
Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors. We propose a new efficient post-training stage for ViTs called locality alignment. We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z)
Phrase Grounding-based Style Transfer for Single-Domain Generalized Object Detection [109.58348694132091]
Single-domain generalized object detection aims to enhance a model's generalizability to multiple unseen target domains. This is a practical yet challenging task as it requires the model to address domain shift without incorporating target domain data into training. We propose a novel phrase grounding-based style transfer approach for the task.
arXiv Detail & Related papers (2024-02-02T10:48:43Z)
The Unreasonable Effectiveness of Large Language-Vision Models for Source-free Video Domain Adaptation [56.61543110071199]
Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset. Previous approaches have attempted to address SFVUDA by leveraging self-supervision derived from the target data itself. We take an approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift.
arXiv Detail & Related papers (2023-08-17T18:12:05Z)
IFSeg: Image-free Semantic Segmentation via Vision-Language Model [67.62922228676273]
We introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories. We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens. Our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods.
arXiv Detail & Related papers (2023-03-25T08:19:31Z)
P{\O}DA: Prompt-driven Zero-shot Domain Adaptation [27.524962843495366]
We adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt. We show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation.
arXiv Detail & Related papers (2022-12-06T18:59:58Z)
Fine-grained Visual-Text Prompt-Driven Self-Training for Open-Vocabulary Object Detection [87.39089806069707]
We propose a fine-grained Visual-Text Prompt-driven self-training paradigm for Open-Vocabulary Detection (VTP-OVD) During the adapting stage, we enable VLM to obtain fine-grained alignment by using learnable text prompts to resolve an auxiliary dense pixel-wise prediction task. Experiments show that our method achieves the state-of-the-art performance for open-vocabulary object detection, e.g., 31.5% mAP on unseen classes of COCO.
arXiv Detail & Related papers (2022-11-02T03:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.