Multi-Modal Pedestrian Detection with Large Misalignment Based on
Modal-Wise Regression and Multi-Modal IoU
- URL: http://arxiv.org/abs/2107.11196v1
- Date: Fri, 23 Jul 2021 12:58:41 GMT
- Title: Multi-Modal Pedestrian Detection with Large Misalignment Based on
Modal-Wise Regression and Multi-Modal IoU
- Authors: Napat Wanchaitanawong, Masayuki Tanaka, Takashi Shibata, Masatoshi
Okutomi
- Abstract summary: The combined use of multiple modalities enables accurate pedestrian detection under poor lighting conditions.
The vital assumption for the combination use is that there is no or only a weak misalignment between the two modalities.
In this paper, we propose a multi-modal Faster-RCNN that is robust against large misalignment.
- Score: 15.59089347915245
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The combined use of multiple modalities enables accurate pedestrian detection
under poor lighting conditions by using the high visibility areas from these
modalities together. The vital assumption for the combination use is that there
is no or only a weak misalignment between the two modalities. In general,
however, this assumption often breaks in actual situations. Due to this
assumption's breakdown, the position of the bounding boxes does not match
between the two modalities, resulting in a significant decrease in detection
accuracy, especially in regions where the amount of misalignment is large. In
this paper, we propose a multi-modal Faster-RCNN that is robust against large
misalignment. The keys are 1) modal-wise regression and 2) multi-modal IoU for
mini-batch sampling. To deal with large misalignment, we perform bounding box
regression for both the RPN and detection-head with both modalities. We also
propose a new sampling strategy called "multi-modal mini-batch sampling" that
integrates the IoU for both modalities. We demonstrate that the proposed
method's performance is much better than that of the state-of-the-art methods
for data with large misalignment through actual image experiments.
Related papers
- FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone.
Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z) - Centering the Value of Every Modality: Towards Efficient and Resilient Modality-agnostic Semantic Segmentation [7.797154022794006]
Recent endeavors regard RGB modality as the center and the others as the auxiliary, yielding an asymmetric architecture with two branches.
We propose a novel method, named MAGIC, that can be flexibly paired with various backbones, ranging from compact to high-performance models.
Our method achieves state-of-the-art performance while reducing the model parameters by 60%.
arXiv Detail & Related papers (2024-07-16T03:19:59Z) - AMFD: Distillation via Adaptive Multimodal Fusion for Multispectral Pedestrian Detection [23.91870504363899]
Double-stream networks in multispectral detection employ two separate feature extraction branches for multi-modal data.
This has hindered the widespread employment of multispectral pedestrian detection in embedded devices for autonomous systems.
We introduce the Adaptive Modal Fusion Distillation (AMFD) framework, which can fully utilize the original modal features of the teacher network.
arXiv Detail & Related papers (2024-05-21T17:17:17Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical
Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition.
We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model.
HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z) - Weakly Aligned Feature Fusion for Multimodal Object Detection [52.15436349488198]
multimodal data often suffer from the position shift problem, i.e., the image pair is not strictly aligned.
This problem makes it difficult to fuse multimodal features and puzzles the convolutional neural network (CNN) training.
In this article, we propose a general multimodal detector named aligned region CNN (AR-CNN) to tackle the position shift problem.
arXiv Detail & Related papers (2022-04-21T02:35:23Z) - Multi-Modal Mutual Information Maximization: A Novel Approach for
Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH)
We learn informative representations that can preserve both intra- and inter-modal similarities.
The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z) - Exploring Data Augmentation for Multi-Modality 3D Object Detection [82.9988604088494]
It is counter-intuitive that multi-modality methods based on point cloud and images perform only marginally better or sometimes worse than approaches that solely use point cloud.
We propose a pipeline, named transformation flow, to bridge the gap between single and multi-modality data augmentation with transformation reversing and replaying.
Our method also wins the best PKL award in the 3rd nuScenes detection challenge.
arXiv Detail & Related papers (2020-12-23T15:23:16Z) - Improving Multispectral Pedestrian Detection by Addressing Modality
Imbalance Problems [12.806496583571858]
Multispectral pedestrian detection can adapt to insufficient illumination conditions by leveraging color-thermal modalities.
Compared with traditional pedestrian detection, we find multispectral pedestrian detection suffers from modality imbalance problems.
We propose Modality Balance Network (MBNet) which facilitates the optimization process in a much more flexible and balanced manner.
arXiv Detail & Related papers (2020-08-07T08:58:46Z) - MuCAN: Multi-Correspondence Aggregation Network for Video
Super-Resolution [63.02785017714131]
Video super-resolution (VSR) aims to utilize multiple low-resolution frames to generate a high-resolution prediction for each frame.
Inter- and intra-frames are the key sources for exploiting temporal and spatial information.
We build an effective multi-correspondence aggregation network (MuCAN) for VSR.
arXiv Detail & Related papers (2020-07-23T05:41:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.