Related papers: Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection

URL: http://arxiv.org/abs/2505.07219v1
Date: Mon, 12 May 2025 04:15:27 GMT
Title: Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection
Authors: Hongda Qin, Xiao Lu, Zhiyong Wei, Yihong Cao, Kailun Yang, Ningjiang Chen,
Abstract summary: Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task.<n>Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM.<n>We propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization.
Score: 12.5655114431805
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task. Existing methods typically introduce image or feature augmentation to diversify the source domain to raise the robustness of the detector. Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM, limiting the detector framework selection. To address this problem, we propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization, which diversifies the source domain by fully utilizing the semantic information of the VLM. Specifically, we first construct prompts to transfer style semantics embedded in the VLM to an image translation network. This facilitates the generation of style diversified images with explicit semantic information. Then, we propose image-level style mixing between the diversified images and source domain images. This effectively mines the semantic information for image augmentation without relying on specific augmentation selections. Finally, we propose feature-level style mixing in a double-pipeline manner, allowing feature augmentation to be model-agnostic and can work seamlessly with the mainstream detector frameworks, including the one-stage, two-stage, and transformer-based detectors. Extensive experiments demonstrate the effectiveness of our approach across various benchmark datasets, including real to cartoon and normal to adverse weather tasks. The source code and pre-trained models will be publicly available at https://github.com/qinhongda8/LDDS.

Related papers

Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection [25.15191353465313]
Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples.<n>We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain.
arXiv Detail & Related papers (2026-02-21T12:10:48Z)
IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z)
Cross-View Open-Vocabulary Object Detection in Aerial Imagery [48.851422992413184]
We propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery.<n>The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings.<n>Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting.
arXiv Detail & Related papers (2025-10-04T16:12:03Z)
A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models.<n>The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps.<n>The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z)
ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification [52.405499816861635]
Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI)<n>We propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification.
arXiv Detail & Related papers (2025-02-12T13:28:46Z)
Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation [16.58381088280145]
We introduce Open-Vocabulary Remote Sensing Image Semantic (OVRSISS) to segment arbitrary semantic classes in remote sensing images.<n>To address the lack of OVRSISS datasets, we develop LandDiscover50K, a comprehensive dataset of 51,846 images covering 40 diverse semantic classes.<n>In addition, we propose a novel framework named GSNet that integrates domain priors from special remote sensing models and versatile capabilities of general vision-language models.
arXiv Detail & Related papers (2024-12-27T07:20:30Z)
I2I-Galip: Unsupervised Medical Image Translation Using Generative Adversarial CLIP [30.506544165999564]
Unpaired image-to-image translation is a challenging task due to the absence of paired examples. We propose a new image-to-image translation framework named Image-to-Image-Generative-Adversarial-CLIP (I2I-Galip)
arXiv Detail & Related papers (2024-09-19T01:44:50Z)
WIDIn: Wording Image for Domain-Invariant Representation in Single-Source Domain Generalization [63.98650220772378]
We present WIDIn, Wording Images for Domain-Invariant representation, to disentangle discriminative visual representation. We first estimate the language embedding with fine-grained alignment, which can be used to adaptively identify and then remove domain-specific counterpart. We show that WIDIn can be applied to both pretrained vision-language models like CLIP, and separately trained uni-modal models like MoCo and BERT.
arXiv Detail & Related papers (2024-05-28T17:46:27Z)
A Dual Attentive Generative Adversarial Network for Remote Sensing Image Change Detection [6.906936669510404]
We propose a dual attentive generative adversarial network for achieving very high-resolution remote sensing image change detection tasks. The DAGAN framework has better performance with 85.01% mean IoU and 91.48% mean F1 score than advanced methods on the LEVIR dataset.
arXiv Detail & Related papers (2023-10-03T08:26:27Z)
Domain-Controlled Prompt Learning [49.45309818782329]
Existing prompt learning methods often lack domain-awareness or domain-transfer mechanisms. We propose a textbfDomain-Controlled Prompt Learning for the specific domains. Our method achieves state-of-the-art performance in specific domain image recognition datasets.
arXiv Detail & Related papers (2023-09-30T02:59:49Z)
I2F: A Unified Image-to-Feature Approach for Domain Adaptive Semantic Segmentation [55.633859439375044]
Unsupervised domain adaptation (UDA) for semantic segmentation is a promising task freeing people from heavy annotation work. Key idea to tackle this problem is to perform both image-level and feature-level adaptation jointly. This paper proposes a novel UDA pipeline for semantic segmentation that unifies image-level and feature-level adaptation.
arXiv Detail & Related papers (2023-01-03T15:19:48Z)
Federated Domain Generalization for Image Recognition via Cross-Client Style Transfer [60.70102634957392]
Domain generalization (DG) has been a hot topic in image recognition, with a goal to train a general model that can perform well on unseen domains. In this paper, we propose a novel domain generalization method for image recognition through cross-client style transfer (CCST) without exchanging data samples. Our method outperforms recent SOTA DG methods on two DG benchmarks (PACS, OfficeHome) and a large-scale medical image dataset (Camelyon17) in the FL setting.
arXiv Detail & Related papers (2022-10-03T13:15:55Z)
StEP: Style-based Encoder Pre-training for Multi-modal Image Synthesis [68.3787368024951]
We propose a novel approach for multi-modal Image-to-image (I2I) translation. We learn a latent embedding, jointly with the generator, that models the variability of the output domain. Specifically, we pre-train a generic style encoder using a novel proxy task to learn an embedding of images, from arbitrary domains, into a low-dimensional style latent space.
arXiv Detail & Related papers (2021-04-14T19:58:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.