$\mathbf{C}^2$Former: Calibrated and Complementary Transformer for
RGB-Infrared Object Detection
- URL: http://arxiv.org/abs/2306.16175v3
- Date: Wed, 13 Mar 2024 10:57:24 GMT
- Title: $\mathbf{C}^2$Former: Calibrated and Complementary Transformer for
RGB-Infrared Object Detection
- Authors: Maoxun Yuan, Xingxing Wei
- Abstract summary: We propose a novel Calibrated and Complementary Transformer called $mathrmC2$Former to address modality miscalibration and imprecision problems.
Because $mathrmC2$Former performs in the feature domain, it can be embedded into existed RGB-IR object detectors via the backbone network.
- Score: 18.27510863075184
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Object detection on visible (RGB) and infrared (IR) images, as an emerging
solution to facilitate robust detection for around-the-clock applications, has
received extensive attention in recent years. With the help of IR images,
object detectors have been more reliable and robust in practical applications
by using RGB-IR combined information. However, existing methods still suffer
from modality miscalibration and fusion imprecision problems. Since transformer
has the powerful capability to model the pairwise correlations between
different features, in this paper, we propose a novel Calibrated and
Complementary Transformer called $\mathrm{C}^2$Former to address these two
problems simultaneously. In $\mathrm{C}^2$Former, we design an Inter-modality
Cross-Attention (ICA) module to obtain the calibrated and complementary
features by learning the cross-attention relationship between the RGB and IR
modality. To reduce the computational cost caused by computing the global
attention in ICA, an Adaptive Feature Sampling (AFS) module is introduced to
decrease the dimension of feature maps. Because $\mathrm{C}^2$Former performs
in the feature domain, it can be embedded into existed RGB-IR object detectors
via the backbone network. Thus, one single-stage and one two-stage object
detector both incorporating our $\mathrm{C}^2$Former are constructed to
evaluate its effectiveness and versatility. With extensive experiments on the
DroneVehicle and KAIST RGB-IR datasets, we verify that our method can fully
utilize the RGB-IR complementary information and achieve robust detection
results. The code is available at
https://github.com/yuanmaoxun/Calibrated-and-Complementary-Transformer-for-RGB-Infrared-Object-Detec tion.git.
Related papers
- The Solution for the GAIIC2024 RGB-TIR object detection Challenge [5.625794757504552]
RGB-TIR object detection aims to utilize both RGB and TIR images for complementary information during detection.
Our proposed method achieved an mAP score of 0.516 and 0.543 on A and B benchmarks respectively.
arXiv Detail & Related papers (2024-07-04T12:08:36Z) - Removal then Selection: A Coarse-to-Fine Fusion Perspective for RGB-Infrared Object Detection [20.12812979315803]
Object detection utilizing both visible (RGB) and thermal infrared (IR) imagery has garnered extensive attention.
Most existing multi-modal object detection methods directly input the RGB and IR images into deep neural networks.
We propose a novel coarse-to-fine perspective to purify and fuse features from both modalities.
arXiv Detail & Related papers (2024-01-19T14:49:42Z) - RXFOOD: Plug-in RGB-X Fusion for Object of Interest Detection [22.53413063906737]
A crucial part in two-branch RGB-X deep neural networks is how to fuse information across modalities.
We propose RXFOOD for the fusion of features across different scales within the same modality branch and from different modality branches simultaneously.
Experimental results on RGB-NIR salient object detection, RGB-D salient object detection, and RGBFrequency image manipulation detection demonstrate the clear effectiveness of the proposed RXFOOD.
arXiv Detail & Related papers (2023-06-22T01:27:00Z) - CIR-Net: Cross-modality Interaction and Refinement for RGB-D Salient
Object Detection [144.66411561224507]
We present a convolutional neural network (CNN) model, named CIR-Net, based on the novel cross-modality interaction and refinement.
Our network outperforms the state-of-the-art saliency detectors both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-10-06T11:59:19Z) - Translation, Scale and Rotation: Cross-Modal Alignment Meets
RGB-Infrared Vehicle Detection [10.460296317901662]
We find detection in aerial RGB-IR images suffers from cross-modal weakly misalignment problems.
We propose a Translation-Scale-Rotation Alignment (TSRA) module to address the problem.
A two-stream feature alignment detector (TSFADet) based on the TSRA module is constructed for RGB-IR object detection in aerial images.
arXiv Detail & Related papers (2022-09-28T03:06:18Z) - Mirror Complementary Transformer Network for RGB-thermal Salient Object
Detection [16.64781797503128]
RGB-thermal object detection (RGB-T SOD) aims to locate the common prominent objects of an aligned visible and thermal infrared image pair.
In this paper, we propose a novel mirror complementary Transformer network (MCNet) for RGB-T SOD.
Experiments on benchmark and VT723 datasets show that the proposed method outperforms state-of-the-art approaches.
arXiv Detail & Related papers (2022-07-07T20:26:09Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Transformer-based Network for RGB-D Saliency Detection [82.6665619584628]
Key to RGB-D saliency detection is to fully mine and fuse information at multiple scales across the two modalities.
We show that transformer is a uniform operation which presents great efficacy in both feature fusion and feature enhancement.
Our proposed network performs favorably against state-of-the-art RGB-D saliency detection methods.
arXiv Detail & Related papers (2021-12-01T15:53:58Z) - Cross-modality Discrepant Interaction Network for RGB-D Salient Object
Detection [78.47767202232298]
We propose a novel Cross-modality Discrepant Interaction Network (CDINet) for RGB-D SOD.
Two components are designed to implement the effective cross-modality interaction.
Our network outperforms $15$ state-of-the-art methods both quantitatively and qualitatively.
arXiv Detail & Related papers (2021-08-04T11:24:42Z) - Self-Supervised Representation Learning for RGB-D Salient Object
Detection [93.17479956795862]
We use Self-Supervised Representation Learning to design two pretext tasks: the cross-modal auto-encoder and the depth-contour estimation.
Our pretext tasks require only a few and un RGB-D datasets to perform pre-training, which make the network capture rich semantic contexts.
For the inherent problem of cross-modal fusion in RGB-D SOD, we propose a multi-path fusion module.
arXiv Detail & Related papers (2021-01-29T09:16:06Z) - Bi-directional Cross-Modality Feature Propagation with
Separation-and-Aggregation Gate for RGB-D Semantic Segmentation [59.94819184452694]
Depth information has proven to be a useful cue in the semantic segmentation of RGBD images for providing a geometric counterpart to the RGB representation.
Most existing works simply assume that depth measurements are accurate and well-aligned with the RGB pixels and models the problem as a cross-modal feature fusion.
In this paper, we propose a unified and efficient Crossmodality Guided to not only effectively recalibrate RGB feature responses, but also to distill accurate depth information via multiple stages and aggregate the two recalibrated representations alternatively.
arXiv Detail & Related papers (2020-07-17T18:35:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.