Related papers: Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection

Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection

URL: http://arxiv.org/abs/2506.16737v1
Date: Fri, 20 Jun 2025 04:11:39 GMT
Title: Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection
Authors: Liu Zongzhen, Luo Hui, Wang Zhixing, Wei Yuxing, Zuo Haorui, Zhang Jianlin,
Abstract summary: Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security.<n>Due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment.<n>We propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF) to address these issues.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security. To improve robustness, recent studies have explored multimodal detection by fusing visible (RGB) and infrared (IR) imagery. However, due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment. This introduces two major challenges: semantic inconsistency at corresponding spatial locations and modality conflict during feature fusion. Existing methods often address these issues in isolation, limiting their effectiveness. In this paper, we propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF), a unified framework that jointly tackles both challenges in weakly aligned UAV-based object detection. CoDAF comprises two novel modules: the Offset-guided Semantic Alignment (OSA), which estimates attention-based spatial offsets and uses deformable convolution guided by a shared semantic space to align features more precisely; and the Dynamic Attention-guided Fusion Module (DAFM), which adaptively balances modality contributions through gating and refines fused features via spatial-channel dual attention. By integrating alignment and fusion in a unified design, CoDAF enables robust UAV object detection. Experiments on standard benchmarks validate the effectiveness of our approach, with CoDAF achieving a mAP of 78.6% on the DroneVehicle dataset.

Related papers

Tracking the Unstable: Appearance-Guided Motion Modeling for Robust Multi-Object Tracking in UAV-Captured Videos [58.156141601478794]
Multi-object tracking (UAVT) aims to track multiple objects while maintaining consistent identities across frames of a given video.<n>Existing methods typically model motion cues and appearance separately, overlooking their interplay and resulting in suboptimal tracking performance.<n>We propose AMOT, which exploits appearance and motion cues through two key components: an Appearance-Motion Consistency (AMC) matrix and a Motion-aware Track Continuation (MTC) module.
arXiv Detail & Related papers (2025-08-03T12:06:47Z)
WS-DETR: Robust Water Surface Object Detection through Vision-Radar Fusion with Detection Transformer [4.768265044725289]
Water surface object detection faces challenges from blurred edges and diverse object scales.<n>Existing approaches suffer from cross-modal feature conflicts, which negatively affect model robustness.<n>We propose a robust vision-radar fusion model WS-DETR, which achieves state-of-the-art (SOTA) performance.
arXiv Detail & Related papers (2025-04-10T04:16:46Z)
FUSE: Label-Free Image-Event Joint Monocular Depth Estimation via Frequency-Decoupled Alignment and Degradation-Robust Fusion [63.87313550399871]
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability.<n>We propose Self-supervised Transfer (PST) and FrequencyDe-coupled Fusion module (FreDF)<n>PST establishes cross-modal knowledge transfer through latent space alignment with image foundation models.<n>FreDF explicitly decouples high-frequency edge features from low-frequency structural components, resolving modality-specific frequency mismatches.
arXiv Detail & Related papers (2025-03-25T15:04:53Z)
TraF-Align: Trajectory-aware Feature Alignment for Asynchronous Multi-agent Perception [7.382491303268417]
TraF-Align learns the flow path of features by predicting the feature-level trajectory of objects from past observations up to the ego vehicle's current time.<n>This approach corrects spatial misalignment and ensures semantic consistency across agents, effectively compensating for motion.<n>Experiments on two real-world datasets, V2V4Real and DAIR-V2X-Seq, show that TraF-Align sets a new benchmark for asynchronous cooperative perception.
arXiv Detail & Related papers (2025-03-25T06:56:35Z)
Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a novel framework that attempts to precisely align hand poses and interactions by integrating foundation model-driven 2D priors with diffusion-based interaction refinement.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
Griffin: Aerial-Ground Cooperative Detection and Tracking Dataset and Benchmark [15.405137983083875]
Aerial-ground cooperation offers a promising solution by integrating UAVs' aerial views with ground vehicles' local observations.<n>This paper presents a comprehensive solution for aerial-ground cooperative 3D perception through three key contributions.
arXiv Detail & Related papers (2025-03-10T07:00:07Z)
DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection [42.70285733630796]
Infrared-visible object detection aims to achieve robust object detection by leveraging the complementary information of infrared and visible image pairs. fusing misalignment complementary features is difficult, and current methods cannot accurately locate objects in both modalities under misalignment conditions. We propose a Decoupled Position Detection Transformer to address these problems. Experiments on DroneVehicle and KAIST datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-08-12T13:05:43Z)
Cross-Domain Few-Shot Object Detection via Enhanced Open-Set Object Detector [72.05791402494727]
This paper studies the challenging cross-domain few-shot object detection (CD-FSOD) It aims to develop an accurate object detector for novel domains with minimal labeled examples.
arXiv Detail & Related papers (2024-02-05T15:25:32Z)
Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection [6.388430091498446]
We propose two new radar preprocessing techniques to better align radar and camera data. We also introduce a Multi-Task Cross-Modality Attention-Fusion Network (MCAF-Net) for object detection. Our approach outperforms current state-of-the-art radar-camera fusion-based object detectors in the nuScenes dataset.
arXiv Detail & Related papers (2023-07-17T09:26:13Z)
SOOD: Towards Semi-Supervised Oriented Object Detection [57.05141794402972]
This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework. Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark.
arXiv Detail & Related papers (2023-04-10T11:10:42Z)
PSNet: Parallel Symmetric Network for Video Salient Object Detection [85.94443548452729]
We propose a VSOD network with up and down parallel symmetry, named PSNet. Two parallel branches with different dominant modalities are set to achieve complete video saliency decoding.
arXiv Detail & Related papers (2022-10-12T04:11:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.