MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely
Coupled Fusion and Modality-Balanced Optimization
- URL: http://arxiv.org/abs/2302.00290v3
- Date: Sat, 11 Nov 2023 12:27:50 GMT
- Title: MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely
Coupled Fusion and Modality-Balanced Optimization
- Authors: Yinghui Xing, Song Wang, Shizhou Zhang, Guoqiang Liang, Xiuwei Zhang,
Yanning Zhang
- Abstract summary: MultiSpectral pedestrian DEtection TRansformer (MS-DETR) is an end-to-end multispectral pedestrian detector.
MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder.
Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets.
- Score: 43.958268661078925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multispectral pedestrian detection is an important task for many
around-the-clock applications, since the visible and thermal modalities can
provide complementary information especially under low light conditions. Most
of the available multispectral pedestrian detectors are based on non-end-to-end
detectors, while in this paper, we propose MultiSpectral pedestrian DEtection
TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which
extends DETR into the field of multi-modal detection. MS-DETR consists of two
modality-specific backbones and Transformer encoders, followed by a multi-modal
Transformer decoder, and the visible and thermal features are fused in the
multi-modal Transformer decoder. To well resist the misalignment between
multi-modal images, we design a loosely coupled fusion strategy by sparsely
sampling some keypoints from multi-modal features independently and fusing them
with adaptively learned attention weights. Moreover, based on the insight that
not only different modalities, but also different pedestrian instances tend to
have different confidence scores to final detection, we further propose an
instance-aware modality-balanced optimization strategy, which preserves visible
and thermal decoder branches and aligns their predicted slots through an
instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance
on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code
is available at https://github.com/YinghuiXing/MS-DETR .
Related papers
- MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection [44.35734602609513]
We investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models.
We propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection.
arXiv Detail & Related papers (2024-03-22T13:50:27Z) - DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with
Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images.
We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges.
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Multimodal Transformer Using Cross-Channel attention for Object Detection in Remote Sensing Images [1.662438436885552]
Multi-modal fusion has been determined to enhance the accuracy by fusing data from multiple modalities.
We propose a novel multi-modal fusion strategy for mapping relationships between different channels at the early stage.
By addressing fusion in the early stage, as opposed to mid or late-stage methods, our method achieves competitive and even superior performance compared to existing techniques.
arXiv Detail & Related papers (2023-10-21T00:56:11Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - UniTR: A Unified and Efficient Multi-Modal Transformer for
Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR.
UniTR processes a variety of modalities with unified modeling and shared parameters.
UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z) - Multimodal Industrial Anomaly Detection via Hybrid Fusion [59.16333340582885]
We propose a novel multimodal anomaly detection method with hybrid fusion scheme.
Our model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTecD-3 AD dataset.
arXiv Detail & Related papers (2023-03-01T15:48:27Z) - Dynamic MDETR: A Dynamic Multimodal Transformer Decoder for Visual
Grounding [27.568879624013576]
Multimodal transformer exhibits high capacity and flexibility to align image and text for visual grounding.
Existing encoder-only grounding framework suffers from heavy computation due to the self-attention operation with quadratic time complexity.
We present Dynamic Mutilmodal DETR (Dynamic MDETR), by decoupling the whole grounding process into encoding and decoding phases.
arXiv Detail & Related papers (2022-09-28T09:43:02Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.