MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
- URL: http://arxiv.org/abs/2403.15209v2
- Date: Wed, 29 May 2024 12:53:17 GMT
- Title: MSCoTDet: Language-driven Multi-modal Fusion for Improved Multispectral Pedestrian Detection
- Authors: Taeheon Kim, Sangyun Chung, Damin Yeom, Youngjoon Yu, Hak Gu Kim, Yong Man Ro,
- Abstract summary: We investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models.
We propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection.
- Score: 44.35734602609513
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multispectral pedestrian detection is attractive for around-the-clock applications due to the complementary information between RGB and thermal modalities. However, current models often fail to detect pedestrians in certain cases (e.g., thermal-obscured pedestrians), particularly due to the modality bias learned from statistically biased datasets. In this paper, we investigate how to mitigate modality bias in multispectral pedestrian detection using Large Language Models (LLMs). Accordingly, we design a Multispectral Chain-of-Thought (MSCoT) prompting strategy, which prompts the LLM to perform multispectral pedestrian detection. Moreover, we propose a novel Multispectral Chain-of-Thought Detection (MSCoTDet) framework that integrates MSCoT prompting into multispectral pedestrian detection. To this end, we design a Language-driven Multi-modal Fusion (LMF) strategy that enables fusing the outputs of MSCoT prompting with the detection results of vision-based multispectral pedestrian detection models. Extensive experiments validate that MSCoTDet effectively mitigates modality biases and improves multispectral pedestrian detection.
Related papers
- MultiRC: Joint Learning for Time Series Anomaly Prediction and Detection with Multi-scale Reconstructive Contrast [20.857498201188566]
We propose MultiRC to integrate reconstructive and contrastive learning for joint learning of anomaly prediction and detection.
For both anomaly prediction and detection tasks, MultiRC outperforms existing state-of-the-art methods.
arXiv Detail & Related papers (2024-10-21T13:28:28Z) - Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection [47.00174564102467]
Multispectral pedestrian detectors show poor generalization ability on examples beyond statistical correlation.
We propose a novel Causal Mode Multiplexer framework that effectively learns the causalities between multispectral inputs and predictions.
We construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection.
arXiv Detail & Related papers (2024-03-02T19:54:53Z) - Improving Vision Anomaly Detection with the Guidance of Language
Modality [64.53005837237754]
This paper tackles the challenges for vision modality from a multimodal point of view.
We propose Cross-modal Guidance (CMG) to tackle the redundant information issue and sparse space issue.
To learn a more compact latent space for the vision anomaly detector, CMLE learns a correlation structure matrix from the language modality.
arXiv Detail & Related papers (2023-10-04T13:44:56Z) - Multimodal Industrial Anomaly Detection via Hybrid Fusion [59.16333340582885]
We propose a novel multimodal anomaly detection method with hybrid fusion scheme.
Our model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTecD-3 AD dataset.
arXiv Detail & Related papers (2023-03-01T15:48:27Z) - MS-DETR: Multispectral Pedestrian Detection Transformer with Loosely Coupled Fusion and Modality-Balanced Optimization [43.04788370184486]
misalignment and modality imbalance are the most significant issues in multispectral pedestrian detection.
MS-DETR consists of two modality-specific backbones and Transformer encoders, followed by a multi-modal Transformer decoder.
Our end-to-end MS-DETR shows superior performance on the challenging KAIST, CVC-14 and LLVIP benchmark datasets.
arXiv Detail & Related papers (2023-02-01T07:45:10Z) - Prompting for Multi-Modal Tracking [70.0522146292258]
We propose a novel multi-modal prompt tracker (ProTrack) for multi-modal tracking.
ProTrack can transfer the multi-modal inputs to a single modality by the prompt paradigm.
Our ProTrack can achieve high-performance multi-modal tracking by only altering the inputs, even without any extra training on multi-modal data.
arXiv Detail & Related papers (2022-07-29T09:35:02Z) - Cross-Modality Fusion Transformer for Multispectral Object Detection [0.0]
Multispectral image pairs can provide the combined information, making object detection applications more reliable and robust.
We present a simple yet effective cross-modality feature fusion approach, named Cross-Modality Fusion Transformer (CFT) in this paper.
arXiv Detail & Related papers (2021-10-30T15:34:12Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.