Related papers: Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images

URL: http://arxiv.org/abs/2602.01954v1
Date: Mon, 02 Feb 2026 11:03:01 GMT
Title: Beyond Open Vocabulary: Multimodal Prompting for Object Detection in Remote Sensing Images
Authors: Shuai Yang, Ziyue Huang, Jiaxin Chen, Qingjie Liu, Yunhong Wang,
Abstract summary: Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories.<n>In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics.<n>We propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting.
Score: 52.7196029918473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Open-vocabulary object detection in remote sensing commonly relies on text-only prompting to specify target categories, implicitly assuming that inference-time category queries can be reliably grounded through pretraining-induced text-visual alignment. In practice, this assumption often breaks down in remote sensing scenarios due to task- and application-specific category semantics, resulting in unstable category specification under open-vocabulary settings. To address this limitation, we propose RS-MPOD, a multimodal open-vocabulary detection framework that reformulates category specification beyond text-only prompting by incorporating instance-grounded visual prompts, textual prompts, and their multimodal integration. RS-MPOD introduces a visual prompt encoder to extract appearance-based category cues from exemplar instances, enabling text-free category specification, and a multimodal fusion module to integrate visual and textual information when both modalities are available. Extensive experiments on standard, cross-dataset, and fine-grained remote sensing benchmarks show that visual prompting yields more reliable category specification under semantic ambiguity and distribution shifts, while multimodal prompting provides a flexible alternative that remains competitive when textual semantics are well aligned.

Related papers

Few-Shot Remote Sensing Image Scene Classification with CLIP and Prompt Learning [0.9558392439655014]
We explore prompt learning as a lightweight and efficient adaptation strategy for few-shot remote sensing image scene classification.<n>We benchmark these prompt-learning methods against two standard baselines: zero-shot CLIP with hand-crafted prompts and a linear probe trained on frozen CLIP features.<n>Our findings underscore prompt learning as a scalable and efficient solution for bridging the domain gap in satellite and aerial imagery.
arXiv Detail & Related papers (2025-10-28T11:39:22Z)
Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z)
Text-guided Visual Prompt DINO for Generic Segmentation [31.33676182634522]
We propose Prompt-DINO, a text-guided visual Prompt DINO framework.<n>First, we introduce an early fusion mechanism that unifies text/visual prompts and backbone features.<n>Second, we design order-aligned query selection for DETR-based architectures.<n>Third, we develop a generative data engine powered by the Recognize Anything via Prompting (RAP) model.
arXiv Detail & Related papers (2025-08-08T09:09:30Z)
SDVPT: Semantic-Driven Visual Prompt Tuning for Open-World Object Counting [70.49268117587562]
We propose a plug-and-play Semantic-Driven Visual Prompt Tuning framework (SDVPT) that transfers knowledge from the training set to unseen categories.<n>During inference, we dynamically synthesize the visual prompts for unseen categories based on the semantic correlation between unseen and training categories.
arXiv Detail & Related papers (2025-04-24T09:31:08Z)
Beyond Mask: Rethinking Guidance Types in Few-shot Segmentation [67.35274834837064]
We develop a universal vision-language framework (UniFSS) to integrate prompts from text, mask, box, and image. UniFSS significantly outperforms the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-16T08:41:01Z)
Test-time Contrastive Concepts for Open-world Semantic Segmentation with Vision-Language Models [14.899741072838994]
Recent CLIP-like Vision-Language Models (VLMs), pre-trained on large amounts of image-text pairs, have paved the way to open-vocabulary semantic segmentation.<n>We propose two different approaches to automatically generate, at test time, query-specific textual contrastive concepts.
arXiv Detail & Related papers (2024-07-06T12:18:43Z)
OVMR: Open-Vocabulary Recognition with Multi-Modal References [96.21248144937627]
Existing works have proposed different methods to embed category cues into the model, eg, through few-shot fine-tuning. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. The proposed OVMR is a plug-and-play module, and works well with exemplar images randomly crawled from the Internet.
arXiv Detail & Related papers (2024-06-07T06:45:28Z)
LLMs Meet VLMs: Boost Open Vocabulary Object Detection with Fine-grained Descriptors [58.75140338866403]
DVDet is a Descriptor-Enhanced Open Vocabulary Detector. It transforms regional embeddings into image-like representations that can be directly integrated into general open vocabulary detection training. Extensive experiments over multiple large-scale benchmarks show that DVDet outperforms the state-of-the-art consistently by large margins.
arXiv Detail & Related papers (2024-02-07T07:26:49Z)
Exploration of visual prompt in Grounded pre-trained open-set detection [6.560519631555968]
We propose a novel visual prompt method that learns new category knowledge from a few labeled images. We evaluate the method on the ODinW dataset and show that it outperforms existing prompt learning methods.
arXiv Detail & Related papers (2023-12-14T11:52:35Z)
PromptDet: Expand Your Detector Vocabulary with Uncurated Images [47.600059694034]
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. We propose a two-stage open-vocabulary object detector that categorises each box proposal by a classifier generated from the text encoder of a pre-trained visual-language model. To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource, iteratively updating the prompts, and later self-training the proposed detector with pseudo labels generated on a large corpus of noisy, uncurated web images.
arXiv Detail & Related papers (2022-03-30T17:50:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.