From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
- URL: http://arxiv.org/abs/2512.15971v1
- Date: Wed, 17 Dec 2025 21:06:36 GMT
- Title: From Words to Wavelengths: VLMs for Few-Shot Multispectral Object Detection
- Authors: Manuel Nkegoum, Minh-Tan Pham, Élisa Fromont, Bruno Avignon, Sébastien Lefèvre,
- Abstract summary: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance.<n>Motivated by the recent success of Vision-Language Models in computer vision, we explore their potential for few-shot multispectral object detection.
- Score: 7.459632891054827
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multispectral object detection is critical for safety-sensitive applications such as autonomous driving and surveillance, where robust perception under diverse illumination conditions is essential. However, the limited availability of annotated multispectral data severely restricts the training of deep detectors. In such data-scarce scenarios, textual class information can serve as a valuable source of semantic supervision. Motivated by the recent success of Vision-Language Models (VLMs) in computer vision, we explore their potential for few-shot multispectral object detection. Specifically, we adapt two representative VLM-based detectors, Grounding DINO and YOLO-World, to handle multispectral inputs and propose an effective mechanism to integrate text, visual and thermal modalities. Through extensive experiments on two popular multispectral image benchmarks, FLIR and M3FD, we demonstrate that VLM-based detectors not only excel in few-shot regimes, significantly outperforming specialized multispectral models trained with comparable data, but also achieve competitive or superior results under fully supervised settings. Our findings reveal that the semantic priors learned by large-scale VLMs effectively transfer to unseen spectral modalities, ofFering a powerful pathway toward data-efficient multispectral perception.
Related papers
- MODA: The First Challenging Benchmark for Multispectral Object Detection in Aerial Images [26.48439423478357]
We introduce the first large-scale dataset for Multispectral Object Detection in Aerial images (MODA)<n>This dataset comprises 14,041 MSIs and 330,191 annotations across diverse, challenging scenarios.<n>We also propose OSSDet, a framework that integrates spectral and spatial information with object-aware cues.
arXiv Detail & Related papers (2025-12-10T10:07:06Z) - A Tri-Modal Dataset and a Baseline System for Tracking Unmanned Aerial Vehicles [74.8162337823142]
MM-UAV is the first large-scale benchmark for Multi-Modal UAV Tracking.<n>The dataset spans over 30 challenging scenarios, with 1,321 synchronised multi-modal sequences, and more than 2.8 million annotated frames.<n>Accompanying the dataset, we provide a novel multi-modal multi-UAV tracking framework.
arXiv Detail & Related papers (2025-11-23T08:42:17Z) - IAD-GPT: Advancing Visual Knowledge in Multimodal Large Language Model for Industrial Anomaly Detection [70.02774285130238]
This paper explores the combination of rich text semantics with both image-level and pixel-level information from images.<n>We propose IAD-GPT, a novel paradigm based on MLLMs for Industrial Anomaly Detection.<n>Experiments on MVTec-AD and VisA datasets demonstrate our state-of-the-art performance.
arXiv Detail & Related papers (2025-10-16T02:48:05Z) - MGCR-Net:Multimodal Graph-Conditioned Vision-Language Reconstruction Network for Remote Sensing Change Detection [55.702662643521265]
We propose the multimodal graph-conditioned vision-language reconstruction network (MGCR-Net) to explore the semantic interaction capabilities of multimodal data.<n> Experimental results on four public datasets demonstrate that MGCR achieves superior performance compared to mainstream CD methods.
arXiv Detail & Related papers (2025-08-03T02:50:08Z) - DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection [0.46040036610482665]
We present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information.<n> Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images.<n>To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales.
arXiv Detail & Related papers (2025-06-29T14:19:18Z) - DINO-CoDT: Multi-class Collaborative Detection and Tracking with Vision Foundation Models [11.34839442803445]
We propose a multi-class collaborative detection and tracking framework tailored for diverse road users.<n>We first present a detector with a global spatial attention fusion (GSAF) module, enhancing multi-scale feature learning for objects of varying sizes.<n>Next, we introduce a tracklet RE-IDentification (REID) module that leverages visual semantics with a vision foundation model to effectively reduce ID SWitch (IDSW) errors.
arXiv Detail & Related papers (2025-06-09T02:49:10Z) - Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models [70.41727912081463]
Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images.<n>We propose a framework to equip MLLMs with robust multi-frame spatial understanding by integrating depth perception, visual correspondence, and dynamic perception.<n>Our model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable, generalizable multi-frame reasoning.
arXiv Detail & Related papers (2025-05-22T17:59:39Z) - Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models [50.98559225639266]
We investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories.<n>Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance.<n>We propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions.
arXiv Detail & Related papers (2024-12-26T05:41:31Z) - Optimizing Multispectral Object Detection: A Bag of Tricks and Comprehensive Benchmarks [49.84182981950623]
Multispectral object detection, utilizing RGB and TIR (thermal infrared) modalities, is widely recognized as a challenging task.<n>It requires not only the effective extraction of features from both modalities and robust fusion strategies, but also the ability to address issues such as spectral discrepancies.<n>We introduce an efficient and easily deployable multispectral object detection framework that can seamlessly optimize high-performing single-modality models.
arXiv Detail & Related papers (2024-11-27T12:18:39Z) - Mutual Information Regularization for Weakly-supervised RGB-D Salient
Object Detection [33.210575826086654]
We present a weakly-supervised RGB-D salient object detection model via supervision.
We focus on effective multimodal representation learning via inter-modal mutual information regularization.
arXiv Detail & Related papers (2023-06-06T12:36:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.