Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection
- URL: http://arxiv.org/abs/2602.18811v1
- Date: Sat, 21 Feb 2026 12:10:48 GMT
- Title: Learning Multi-Modal Prototypes for Cross-Domain Few-Shot Object Detection
- Authors: Wanqi Wang, Jingcai Guo, Yuxiang Cai, Zhi Chen,
- Abstract summary: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples.<n>We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain.
- Score: 25.15191353465313
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Cross-Domain Few-Shot Object Detection (CD-FSOD) aims to detect novel classes in unseen target domains given only a few labeled examples. While open-vocabulary detectors built on vision-language models (VLMs) transfer well, they depend almost entirely on text prompts, which encode domain-invariant semantics but miss domain-specific visual information needed for precise localization under few-shot supervision. We propose a dual-branch detector that Learns Multi-modal Prototypes, dubbed LMP, by coupling textual guidance with visual exemplars drawn from the target domain. A Visual Prototype Construction module aggregates class-level prototypes from support RoIs and dynamically generates hard-negative prototypes in query images via jittered boxes, capturing distractors and visually similar backgrounds. In the visual-guided branch, we inject these prototypes into the detection pipeline with components mirrored from the text branch as the starting point for training, while a parallel text-guided branch preserves open-vocabulary semantics. The branches are trained jointly and ensembled at inference by combining semantic abstraction with domain-adaptive details. On six cross-domain benchmark datasets and standard 1/5/10-shot settings, our method achieves state-of-the-art or highly competitive mAP.
Related papers
- Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding [11.244257545057508]
Prototype-Aware Multimodal Learning (PAML) is an innovative framework that addresses imperfect alignment between visual and linguistic modalities, insufficient cross-modal feature fusion, and ineffective utilization of semantic prototype information.<n>Our framework shows competitive performance in standard scene while achieving state-of-the-art results in open-vocabulary scene.
arXiv Detail & Related papers (2025-09-08T02:27:10Z) - Language-Driven Dual Style Mixing for Single-Domain Generalized Object Detection [12.5655114431805]
Generalizing an object detector trained on a single domain to multiple unseen domains is a challenging task.<n>Vision-Language Model (VLM)-based augmentation techniques have been proven to be effective, but they require that the detector's backbone has the same structure as the image encoder of VLM.<n>We propose Language-Driven Dual Style Mixing (LDDS) for single-domain generalization.
arXiv Detail & Related papers (2025-05-12T04:15:27Z) - Exploring Robust Features for Few-Shot Object Detection in Satellite
Imagery [17.156864650143678]
We develop a few-shot object detector based on a traditional two-stage architecture.
A large-scale pre-trained model is used to build class-reference embeddings or prototypes.
We perform evaluations on two remote sensing datasets containing challenging and rare objects.
arXiv Detail & Related papers (2024-03-08T15:20:27Z) - Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector [72.05791402494727]
This paper studies the challenging cross-domain few-shot object detection (CD-FSOD)
It aims to develop an accurate object detector for novel domains with minimal labeled examples.
arXiv Detail & Related papers (2024-02-05T15:25:32Z) - Phrase Grounding-based Style Transfer for Single-Domain Generalized
Object Detection [109.58348694132091]
Single-domain generalized object detection aims to enhance a model's generalizability to multiple unseen target domains.
This is a practical yet challenging task as it requires the model to address domain shift without incorporating target domain data into training.
We propose a novel phrase grounding-based style transfer approach for the task.
arXiv Detail & Related papers (2024-02-02T10:48:43Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Generalizing Multiple Object Tracking to Unseen Domains by Introducing
Natural Language Representation [33.03600813115465]
We propose to introduce natural language representation into visual MOT models for boosting the domain generalization ability.
To tackle this problem, we design two modules, namely visual context prompting (VCP) and visual-language mixing (VLM)
VLM joints the information in the generated visual prompts and the textual prompts from a pre-defined Trackbook to obtain instance-level pseudo textual description.
Through training models on MOT17 and validating them on MOT20, we observe that the pseudo textual descriptions generated by our proposed modules improve the generalization performance of query-based trackers by large margins.
arXiv Detail & Related papers (2022-12-03T07:57:31Z) - Cross Domain Object Detection by Target-Perceived Dual Branch
Distillation [49.68119030818388]
Cross domain object detection is a realistic and challenging task in the wild.
We propose a novel Target-perceived Dual-branch Distillation (TDD) framework.
Our TDD significantly outperforms the state-of-the-art methods on all the benchmarks.
arXiv Detail & Related papers (2022-05-03T03:51:32Z) - Multi-Modal Few-Shot Object Detection with Meta-Learning-Based
Cross-Modal Prompting [77.69172089359606]
We study multi-modal few-shot object detection (FSOD) in this paper, using both few-shot visual examples and class semantic information for detection.
Our approach is motivated by the high-level conceptual similarity of (metric-based) meta-learning and prompt-based learning.
We comprehensively evaluate the proposed multi-modal FSOD models on multiple few-shot object detection benchmarks, achieving promising results.
arXiv Detail & Related papers (2022-04-16T16:45:06Z) - Multi-modal Transformers Excel at Class-agnostic Object Detection [105.10403103027306]
We argue that existing methods lack a top-down supervision signal governed by human-understandable semantics.
We develop an efficient and flexible MViT architecture using multi-scale feature processing and deformable self-attention.
We show the significance of MViT proposals in a diverse range of applications.
arXiv Detail & Related papers (2021-11-22T18:59:29Z) - AFAN: Augmented Feature Alignment Network for Cross-Domain Object
Detection [90.18752912204778]
Unsupervised domain adaptation for object detection is a challenging problem with many real-world applications.
We propose a novel augmented feature alignment network (AFAN) which integrates intermediate domain image generation and domain-adversarial training.
Our approach significantly outperforms the state-of-the-art methods on standard benchmarks for both similar and dissimilar domain adaptations.
arXiv Detail & Related papers (2021-06-10T05:01:20Z) - MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding [40.24656027709833]
We propose MDETR, an end-to-end modulated detector that detects objects in an image conditioned on a raw text query.
We use a transformer-based architecture to reason jointly over text and image by fusing the two modalities at an early stage of the model.
Our approach can be easily extended for visual question answering, achieving competitive performance on GQA and CLEVR.
arXiv Detail & Related papers (2021-04-26T17:55:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.