MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
- URL: http://arxiv.org/abs/2511.19134v1
- Date: Mon, 24 Nov 2025 13:59:01 GMT
- Title: MambaRefine-YOLO: A Dual-Modality Small Object Detector for UAV Imagery
- Authors: Shuyu Cao, Minxin Chen, Yucheng Song, Zhaozhong Chen, Xinyou Zhang,
- Abstract summary: Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter.<n>We introduce Mamba-YOLO, a fusion module that balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms.<n>Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
- Score: 1.005854289245731
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Small object detection in Unmanned Aerial Vehicle (UAV) imagery is a persistent challenge, hindered by low resolution and background clutter. While fusing RGB and infrared (IR) data offers a promising solution, existing methods often struggle with the trade-off between effective cross-modal interaction and computational efficiency. In this letter, we introduce MambaRefine-YOLO. Its core contributions are a Dual-Gated Complementary Mamba fusion module (DGC-MFM) that adaptively balances RGB and IR modalities through illumination-aware and difference-aware gating mechanisms, and a Hierarchical Feature Aggregation Neck (HFAN) that uses a ``refine-then-fuse'' strategy to enhance multi-scale features. Our comprehensive experiments validate this dual-pronged approach. On the dual-modality DroneVehicle dataset, the full model achieves a state-of-the-art mAP of 83.2%, an improvement of 7.9% over the baseline. On the single-modality VisDrone dataset, a variant using only the HFAN also shows significant gains, demonstrating its general applicability. Our work presents a superior balance between accuracy and speed, making it highly suitable for real-world UAV applications.
Related papers
- YOLO-DS: Fine-Grained Feature Decoupling via Dual-Statistic Synergy Operator for Object Detection [55.58092342624062]
We propose YOLO-DS, a framework built around a novel Dual-Statistic Synergy Operator (DSO)<n>YOLO-DS decouples object features by jointly modeling the channel-wise mean and the peak-to-mean difference.<n>On the MS-COCO benchmark, YOLO-DS consistently outperforms YOLOv8 across five model scales.
arXiv Detail & Related papers (2026-01-26T05:50:32Z) - DiffusionDriveV2: Reinforcement Learning-Constrained Truncated Diffusion Modeling in End-to-End Autonomous Driving [65.7087560656003]
Generative diffusion models for end-to-end autonomous driving often suffer from mode collapse.<n>We propose DiffusionDriveV2, which leverages reinforcement learning to constrain low-quality modes and explore for superior trajectories.<n>This significantly enhances the overall output quality while preserving the inherent multimodality of its core Gaussian Mixture Model.
arXiv Detail & Related papers (2025-12-08T17:29:52Z) - ShortcutBreaker: Low-Rank Noisy Bottleneck with Global Perturbation Attention for Multi-Class Unsupervised Anomaly Detection [59.89803740308262]
ShortcutBreaker is a novel unified feature-reconstruction framework for MUAD tasks.<n>It features two key innovations to address the issue of shortcuts.<n>The proposed method achieves a remarkable image-level AUROC of 99.8%, 98.9%, 90.6%, and 87.8% on four datasets.
arXiv Detail & Related papers (2025-10-21T06:51:30Z) - UNIV: Unified Foundation Model for Infrared and Visible Modalities [12.0490466425884]
We propose a biologically inspired UNified foundation model for Infrared and Visible modalities (UNIV)<n>PCCL is an attention-guided distillation framework that mimics retinal horizontal cells' lateral inhibition.<n>Our dual-knowledge preservation mechanism emulates the retina's bipolar cell signal routing.
arXiv Detail & Related papers (2025-09-19T06:07:53Z) - DEPFusion: Dual-Domain Enhancement and Priority-Guided Mamba Fusion for UAV Multispectral Object Detection [6.4402018224356015]
A framework named DEPFusion is proposed for UAV multispectral object detection.<n>It consists of Dual-Domain Enhancement (DDE) and Priority-Guided Mamba Fusion (PGMF)<n>Experiments on DroneVehicle and VEDAI datasets demonstrate that DEPFusion achieves good performance with state-of-the-art methods.
arXiv Detail & Related papers (2025-09-09T01:51:57Z) - DGE-YOLO: Dual-Branch Gathering and Attention for Accurate UAV Object Detection [0.46040036610482665]
We present DGE-YOLO, an enhanced YOLO-based detection framework designed to effectively fuse multi-modal information.<n> Specifically, we introduce a dual-branch architecture for modality-specific feature extraction, enabling the model to process both infrared and visible images.<n>To further enrich semantic representation, we propose an Efficient Multi-scale Attention (EMA) mechanism that enhances feature learning across spatial scales.
arXiv Detail & Related papers (2025-06-29T14:19:18Z) - Adaptive Illumination-Invariant Synergistic Feature Integration in a Stratified Granular Framework for Visible-Infrared Re-Identification [18.221111822542024]
Visible-Infrared Person Re-Identification (VI-ReID) plays a crucial role in applications such as search and rescue, infrastructure protection, and nighttime surveillance.<n>We propose textbfAMINet, an Adaptive Modality Interaction Network.<n>AMINet employs multi-granularity feature extraction to capture comprehensive identity attributes from both full-body and upper-body images.
arXiv Detail & Related papers (2025-02-28T15:42:58Z) - Bringing RGB and IR Together: Hierarchical Multi-Modal Enhancement for Robust Transmission Line Detection [67.02804741856512]
We propose a novel Hierarchical Multi-Modal Enhancement Network (HMMEN) that integrates RGB and IR data for robust and accurate TL detection.<n>Our method introduces two key components: (1) a Mutual Multi-Modal Enhanced Block (MMEB), which fuses and enhances hierarchical RGB and IR feature maps in a coarse-to-fine manner, and (2) a Feature Alignment Block (FAB) that corrects misalignments between decoder outputs and IR feature maps by leveraging deformable convolutions.
arXiv Detail & Related papers (2025-01-25T06:21:06Z) - DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection [5.946464547429392]
Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images.<n>We propose a dual-enhancement-based cross-modality object detection network DEYOLO.<n>Our approach outperforms SOTA object detection algorithms by a clear margin.
arXiv Detail & Related papers (2024-12-06T10:39:11Z) - Robust Modality-incomplete Anomaly Detection: A Modality-instructive Framework with Benchmark [69.02666229531322]
We introduce a pioneering study that investigates Modality-Incomplete Industrial Anomaly Detection (MIIAD)<n>We find that most existing MIAD methods perform poorly on the MIIAD Bench, leading to significant performance degradation.<n>We propose a novel two-stage Robust modAlity-aware fusing and Detecting framewoRk, abbreviated as RADAR.
arXiv Detail & Related papers (2024-10-02T16:47:55Z) - Target-aware Dual Adversarial Learning and a Multi-scenario
Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection [65.30079184700755]
This study addresses the issue of fusing infrared and visible images that appear differently for object detection.
Previous approaches discover commons underlying the two modalities and fuse upon the common space either by iterative optimization or deep networks.
This paper proposes a bilevel optimization formulation for the joint problem of fusion and detection, and then unrolls to a target-aware Dual Adversarial Learning (TarDAL) network for fusion and a commonly used detection network.
arXiv Detail & Related papers (2022-03-30T11:44:56Z) - Learning Selective Mutual Attention and Contrast for RGB-D Saliency
Detection [145.4919781325014]
How to effectively fuse cross-modal information is the key problem for RGB-D salient object detection.
Many models use the feature fusion strategy but are limited by the low-order point-to-point fusion methods.
We propose a novel mutual attention model by fusing attention and contexts from different modalities.
arXiv Detail & Related papers (2020-10-12T08:50:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.