Related papers: MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization

URL: http://arxiv.org/abs/2302.00290v3
Date: Sat, 11 Nov 2023 12:27:50 GMT
Title: MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization
Authors: Yinghui Xing, Song Wang, Shizhou Zhang, Guoqiang Liang, Xiuwei Zhang, Yanning Zhang
Abstract summary: MultiSpectral pedestrian DEtection TRansformer (MS-DETR) is an end-to-end multispectral pedestrian detector. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets.
Score: 43.958268661078925
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multispectral pedestrian detection is an important task for many around-the-clock applications, since the visible and thermal modalities can provide complementary information especially under low light conditions. Most of the available multispectral pedestrian detectors are based on non-end-to-end detectors, while in this paper, we propose MultiSpectral pedestrian DEtection TRansformer (MS-DETR), an end-to-end multispectral pedestrian detector, which extends DETR into the field of multi-modal detection. MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder, and the visible and thermal features are fused in the multi-modal Transformer decoder. To well resist the misalignment between multi-modal images, we design a loosely coupled fusion strategy by sparsely sampling some keypoints from multi-modal features independently and fusing them with adaptively learned attention weights. Moreover, based on the insight that not only different modalities, but also different pedestrian instances tend to have different confidence scores to final detection, we further propose an instance-aware modality-balanced optimization strategy, which preserves visible and thermal decoder branches and aligns their predicted slots through an instance-wise dynamic loss. Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets. The source code is available at https://github.com/YinghuiXing/MS-DETR .

Related papers

Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images [38.942152581251165]
We propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details.<n>In particular, we design the global-local mixed attention, which captures diverse information through two perspectives.<n>We present a gated linear feed-forward network to increase the expressive ability.
arXiv Detail & Related papers (2025-06-27T02:40:48Z)
AuxDet: Auxiliary Metadata Matters for Omni-Domain Infrared Small Target Detection [58.67129770371016]
We propose a novel IRSTD framework that reimagines the IRSTD paradigm by incorporating textual metadata for scene-aware optimization.<n>AuxDet consistently outperforms state-of-the-art methods, validating the critical role of auxiliary information in improving robustness and accuracy.
arXiv Detail & Related papers (2025-05-21T07:02:05Z)
A Split-Window Transformer for Multi-Model Sequence Spammer Detection using Multi-Model Variational Autoencoder [4.738887010407782]
This paper introduces a new Transformer, called MS$2$Dformer, that can be used as a generalized backbone for spammer detection. Design a user behavior Tokenization algorithm based on the multi-modal variational autoencoder (MVAE) Pre-trained on the public datasets, MS$2$Dformer's performance far exceeds the previous state of the art.
arXiv Detail & Related papers (2025-02-23T07:53:08Z)
Multi-scale Bottleneck Transformer for Weakly Supervised Multimodal Violence Detection [9.145305176998447]
Weakly supervised multimodal violence detection aims to learn a violence detection model by leveraging multiple modalities. We propose a new weakly supervised MVD method that explicitly addresses the challenges of information redundancy, modality imbalance, and modality asynchrony. Experiments on the largest-scale XD-Violence dataset demonstrate that the proposed method achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-05-08T15:27:08Z)
MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection [44.35734602609513]
We investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models. We propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection.
arXiv Detail & Related papers (2024-03-22T13:50:27Z)
DAMSDet: Dynamic Adaptive Multispectral Detection Transformer with Competitive Query Selection and Adaptive Feature Fusion [82.2425759608975]
Infrared-visible object detection aims to achieve robust even full-day object detection by fusing the complementary information of infrared and visible images. We propose a Dynamic Adaptive Multispectral Detection Transformer (DAMSDet) to address these two challenges. Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
arXiv Detail & Related papers (2024-03-01T07:03:27Z)
Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter. We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another. Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z)
UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation [113.35352122662752]
We present an efficient multi-modal backbone for outdoor 3D perception named UniTR. UniTR processes a variety of modalities with unified modeling and shared parameters. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks.
arXiv Detail & Related papers (2023-08-15T12:13:44Z)
CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion [138.40422469153145]
We propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. We show that CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion.
arXiv Detail & Related papers (2022-11-26T02:40:28Z)
Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned. This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training. In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z)
Improving Multispectral Pedestrian Detection by Addressing Modality Imbalance Problems [12.806496583571858]
Multispectral pedestrian detection can adapt to insufficient illumination conditions by leveraging color-thermal modalities. Compared with traditional pedestrian detection, we find multispectral pedestrian detection suffers from modality imbalance problems. We propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner.
arXiv Detail & Related papers (2020-08-07T08:58:46Z)

This list is automatically generated from the titles and abstracts of the papers in this site.