When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
- URL: http://arxiv.org/abs/2407.10125v1
- Date: Sun, 14 Jul 2024 09:16:49 GMT
- Title: When Pedestrian Detection Meets Multi-Modal Learning: Generalist Model and Benchmark Dataset
- Authors: Yi Zhang, Wang Zeng, Sheng Jin, Chen Qian, Ping Luo, Wentao Liu,
- Abstract summary: This paper introduces MMPedestron, a novel generalist model for multimodal perception.
The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection.
With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks.
- Score: 40.24765100535353
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent years have witnessed increasing research attention towards pedestrian detection by taking the advantages of different sensor modalities (e.g. RGB, IR, Depth, LiDAR and Event). However, designing a unified generalist model that can effectively process diverse sensor modalities remains a challenge. This paper introduces MMPedestron, a novel generalist model for multimodal perception. Unlike previous specialist models that only process one or a pair of specific modality inputs, MMPedestron is able to process multiple modal inputs and their dynamic combinations. The proposed approach comprises a unified encoder for modal representation and fusion and a general head for pedestrian detection. We introduce two extra learnable tokens, i.e. MAA and MAF, for adaptive multi-modal feature fusion. In addition, we construct the MMPD dataset, the first large-scale benchmark for multi-modal pedestrian detection. This benchmark incorporates existing public datasets and a newly collected dataset called EventPed, covering a wide range of sensor modalities including RGB, IR, Depth, LiDAR, and Event data. With multi-modal joint training, our model achieves state-of-the-art performance on a wide range of pedestrian detection benchmarks, surpassing leading models tailored for specific sensor modality. For example, it achieves 71.1 AP on COCO-Persons and 72.6 AP on LLVIP. Notably, our model achieves comparable performance to the InternImage-H model on CrowdHuman with 30x smaller parameters. Codes and data are available at https://github.com/BubblyYi/MMPedestron.
Related papers
- MANet: Fine-Tuning Segment Anything Model for Multimodal Remote Sensing Semantic Segmentation [8.443065903814821]
This study introduces a novel Multimodal Adapter-based Network (MANet) for multimodal remote sensing semantic segmentation.
At the core of this approach is the development of a Multimodal Adapter (MMAdapter), which fine-tunes SAM's image encoder to effectively leverage the model's general knowledge for multimodal data.
This work not only introduces a novel network for multimodal fusion, but also demonstrates, for the first time, SAM's powerful generalization capabilities with Digital Surface Model (DSM) data.
arXiv Detail & Related papers (2024-10-15T00:52:16Z) - FoRA: Low-Rank Adaptation Model beyond Multimodal Siamese Network [19.466279425330857]
We propose a novel multimodal object detector, named Low-rank Modal Adaptors (LMA) with a shared backbone.
Our work was submitted to ACM MM in April 2024, but was rejected.
arXiv Detail & Related papers (2024-07-23T02:27:52Z) - Towards a Generalist and Blind RGB-X Tracker [91.36268768952755]
We develop a single model tracker that can remain blind to any modality X during inference time.
Our training process is extremely simple, integrating multi-label classification loss with a routing function.
Our generalist and blind tracker can achieve competitive performance compared to well-established modal-specific models.
arXiv Detail & Related papers (2024-05-28T03:00:58Z) - Bi-directional Adapter for Multi-modal Tracking [67.01179868400229]
We propose a novel multi-modal visual prompt tracking model based on a universal bi-directional adapter.
We develop a simple but effective light feature adapter to transfer modality-specific information from one modality to another.
Our model achieves superior tracking performance in comparison with both the full fine-tuning methods and the prompt learning-based methods.
arXiv Detail & Related papers (2023-12-17T05:27:31Z) - Single-Model and Any-Modality for Video Object Tracking [85.83753760853142]
We introduce Un-Track, a Unified Tracker of a single set of parameters for any modality.
To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques.
Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters.
arXiv Detail & Related papers (2023-11-27T14:17:41Z) - Multi-Metric AutoRec for High Dimensional and Sparse User Behavior Data
Prediction [10.351592131677018]
We propose a multi-metric AutoRec (MMA) based on the representative AutoRec.
MMA enjoys the multi-metric orientation from a set of dispersed metric spaces, achieving a comprehensive representation of user data.
MMA can outperform seven other state-of-the-art models in predicting unobserved user behavior data.
arXiv Detail & Related papers (2022-12-20T12:28:07Z) - Dual Swin-Transformer based Mutual Interactive Network for RGB-D Salient
Object Detection [67.33924278729903]
In this work, we propose Dual Swin-Transformer based Mutual Interactive Network.
We adopt Swin-Transformer as the feature extractor for both RGB and depth modality to model the long-range dependencies in visual inputs.
Comprehensive experiments on five standard RGB-D SOD benchmark datasets demonstrate the superiority of the proposed DTMINet method.
arXiv Detail & Related papers (2022-06-07T08:35:41Z) - Flexible-Modal Face Anti-Spoofing: A Benchmark [66.18359076810549]
Face anti-spoofing (FAS) plays a vital role in securing face recognition systems from presentation attacks.
We establish the first flexible-modal FAS benchmark with the principle train one for all'
We also investigate prevalent deep models and feature fusion strategies for flexible-modal FAS.
arXiv Detail & Related papers (2022-02-16T16:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.