Related papers: DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection

URL: http://arxiv.org/abs/2408.06123v1
Date: Mon, 12 Aug 2024 13:05:43 GMT
Title: DPDETR: Decoupled Position Detection Transformer for Infrared-Visible Object Detection
Authors: Junjie Guo, Chenqiang Gao, Fangcen Liu, Deyu Meng,
Abstract summary: Infrared-visible object detection aims to achieve robust object detection by leveraging the complementary information of infrared and visible image pairs. fusing misalignment complementary features is difficult, and current methods cannot accurately locate objects in both modalities under misalignment conditions. We propose a Decoupled Position Detection Transformer to address these problems. Experiments on DroneVehicle and KAIST datasets demonstrate significant improvements compared to other state-of-the-art methods.
Score: 42.70285733630796
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Infrared-visible object detection aims to achieve robust object detection by leveraging the complementary information of infrared and visible image pairs. However, the commonly existing modality misalignment problem presents two challenges: fusing misalignment complementary features is difficult, and current methods cannot accurately locate objects in both modalities under misalignment conditions. In this paper, we propose a Decoupled Position Detection Transformer (DPDETR) to address these problems. Specifically, we explicitly formulate the object category, visible modality position, and infrared modality position to enable the network to learn the intrinsic relationships and output accurate positions of objects in both modalities. To fuse misaligned object features accurately, we propose a Decoupled Position Multispectral Cross-attention module that adaptively samples and aggregates multispectral complementary features with the constraint of infrared and visible reference positions. Additionally, we design a query-decoupled Multispectral Decoder structure to address the optimization gap among the three kinds of object information in our task and propose a Decoupled Position Contrastive DeNosing Training strategy to enhance the DPDETR's ability to learn decoupled positions. Experiments on DroneVehicle and KAIST datasets demonstrate significant improvements compared to other state-of-the-art methods. The code will be released at https://github.com/gjj45/DPDETR.

Related papers

Disentangle Object and Non-object Infrared Features via Language Guidance [35.60538936337868]
We propose a novel vision-language representation learning paradigm for infrared object detection.<n>An additional textual supervision with rich semantic information is explored to guide the disentanglement of object and non-object features.<n>Our approach achieves superior performance on two benchmarks: Mtextsuperscript3FD (83.7% mAP), FLIR (86.1% mAP)
arXiv Detail & Related papers (2026-01-14T06:59:54Z)
TransBridge: Boost 3D Object Detection by Scene-Level Completion with Transformer Decoder [66.22997415145467]
This paper presents a joint completion and detection framework that improves the detection feature in sparse areas.<n> Specifically, we propose TransBridge, a novel transformer-based up-sampling block that fuses the features from the detection and completion networks.<n>The results show that our framework consistently improves end-to-end 3D object detection, with the mean average precision (mAP) ranging from 0.7 to 1.5 across multiple methods.
arXiv Detail & Related papers (2025-12-12T00:08:03Z)
Cross-modal Offset-guided Dynamic Alignment and Fusion for Weakly Aligned UAV Object Detection [0.0]
Unmanned aerial vehicle (UAV) object detection plays a vital role in applications such as environmental monitoring and urban security.<n>Due to UAV platform motion and asynchronous imaging, spatial misalignment frequently occurs between modalities, leading to weak alignment.<n>We propose Cross-modal Offset-guided Dynamic Alignment and Fusion (CoDAF) to address these issues.
arXiv Detail & Related papers (2025-06-20T04:11:39Z)
DEYOLO: Dual-Feature-Enhancement YOLO for Cross-Modality Object Detection [5.946464547429392]
Object detection in poor-illumination environments is a challenging task as objects are usually not clearly visible in RGB images. We propose a dual-enhancement-based cross-modality object detection network DEYOLO. Our approach outperforms SOTA object detection algorithms by a clear margin.
arXiv Detail & Related papers (2024-12-06T10:39:11Z)
OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images [26.37802649901314]
Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. We propose an end-to-end transformer-based oriented object detector consisting of three dedicated modules to address these issues. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_50$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$times$ to 1$times$.
arXiv Detail & Related papers (2024-09-29T10:36:33Z)
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges. Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z)
Removal then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection [20.12812979315803]
Object detection utilizing both visible (RGB) and thermal infrared (IR) imagery has garnered extensive attention. Most existing multi-modal object detection methods directly input the RGB and IR images into deep neural networks. We propose a novel coarse-to-fine perspective to purify and fuse features from both modalities.
arXiv Detail & Related papers (2024-01-19T14:49:42Z)
Multi-Task Cross-Modality Attention-Fusion for 2D Object Detection [6.388430091498446]
We propose two new radar preprocessing techniques to better align radar and camera data. We also introduce a Multi-Task Cross-Modality Attention-Fusion Network (MCAF-Net) for object detection. Our approach outperforms current state-of-the-art radar-camera fusion-based object detectors in the nuScenes dataset.
arXiv Detail & Related papers (2023-07-17T09:26:13Z)
An Interactively Reinforced Paradigm for Joint Infrared-Visible Image Fusion and Saliency Object Detection [59.02821429555375]
This research focuses on the discovery and localization of hidden objects in the wild and serves unmanned systems. Through empirical analysis, infrared and visible image fusion (IVIF) enables hard-to-find objects apparent. multimodal salient object detection (SOD) accurately delineates the precise spatial location of objects within the picture.
arXiv Detail & Related papers (2023-05-17T06:48:35Z)
Enhancing Deformable Local Features by Jointly Learning to Detect and Describe Keypoints [8.390939268280235]
Local feature extraction is a standard approach in computer vision for tackling important tasks such as image matching and retrieval. We propose DALF, a novel deformation-aware network for jointly detecting and describing keypoints. Our approach also enhances the performance of two real-world applications: deformable object retrieval and non-rigid 3D surface registration.
arXiv Detail & Related papers (2023-04-02T18:01:51Z)
Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem. In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images. The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z)
Multitask AET with Orthogonal Tangent Regularity for Dark Object Detection [84.52197307286681]
We propose a novel multitask auto encoding transformation (MAET) model to enhance object detection in a dark environment. In a self-supervision manner, the MAET learns the intrinsic visual structure by encoding and decoding the realistic illumination-degrading transformation. We have achieved the state-of-the-art performance using synthetic and real-world datasets.
arXiv Detail & Related papers (2022-05-06T16:27:14Z)
Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned. This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training. In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z)
Multi-View Adaptive Fusion Network for 3D Object Detection [14.506796247331584]
3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving. We propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection. We design an end-to-end learnable network named MVAF-Net to integrate these two components.
arXiv Detail & Related papers (2020-11-02T00:06:01Z)
Align Deep Features for Oriented Object Detection [40.28244152216309]
We propose a single-shot Alignment Network (S$2$A-Net) consisting of two modules: a Feature Alignment Module (FAM) and an Oriented Detection Module (ODM) The FAM can generate high-quality anchors with an Anchor Refinement Network and adaptively align the convolutional features according to the anchor boxes with a novel Alignment Convolution. The ODM first adopts active rotating filters to encode the orientation information and then produces orientation-sensitive and orientation-invariant features to alleviate the inconsistency between classification score and localization accuracy.
arXiv Detail & Related papers (2020-08-21T09:55:13Z)
Cross-Modality 3D Object Detection [63.29935886648709]
We present a novel two-stage multi-modal fusion network for 3D object detection. The whole architecture facilitates two-stage fusion. Our experiments on the KITTI dataset show that the proposed multi-stage fusion helps the network to learn better representations.
arXiv Detail & Related papers (2020-08-16T11:01:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.