Multi-modal Crowd Counting via Modal Emulation
- URL: http://arxiv.org/abs/2407.19491v1
- Date: Sun, 28 Jul 2024 13:14:57 GMT
- Title: Multi-modal Crowd Counting via Modal Emulation
- Authors: Chenhao Wang, Xiaopeng Hong, Zhiheng Ma, Yupeng Wei, Yabin Wang, Xiaopeng Fan,
- Abstract summary: We propose a modal emulation-based two-pass multi-modal crowd-counting framework.
Framework consists of two key components: a emphmulti-modal inference pass and a emphcross-modal emulation pass.
Experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods.
- Score: 41.959740205234446
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Multi-modal crowd counting is a crucial task that uses multi-modal cues to estimate the number of people in crowded scenes. To overcome the gap between different modalities, we propose a modal emulation-based two-pass multi-modal crowd-counting framework that enables efficient modal emulation, alignment, and fusion. The framework consists of two key components: a \emph{multi-modal inference} pass and a \emph{cross-modal emulation} pass. The former utilizes a hybrid cross-modal attention module to extract global and local information and achieve efficient multi-modal fusion. The latter uses attention prompting to coordinate different modalities and enhance multi-modal alignment. We also introduce a modality alignment module that uses an efficient modal consistency loss to align the outputs of the two passes and bridge the semantic gap between modalities. Extensive experiments on both RGB-Thermal and RGB-Depth counting datasets demonstrate its superior performance compared to previous methods. Code available at https://github.com/Mr-Monday/Multi-modal-Crowd-Counting-via-Modal-Emulation.
Related papers
- Part-Whole Relational Fusion Towards Multi-Modal Scene Understanding [51.96911650437978]
Multi-modal fusion has played a vital role in multi-modal scene understanding.
Most existing methods focus on cross-modal fusion involving two modalities, often overlooking more complex multi-modal fusion.
We propose a relational Part-Whole Fusion (PWRF) framework for multi-modal scene understanding.
arXiv Detail & Related papers (2024-10-19T02:27:30Z) - DepMamba: Progressive Fusion Mamba for Multimodal Depression Detection [37.701518424351505]
Depression is a common mental disorder that affects millions of people worldwide.
We propose an audio-visual progressive fusion Mamba for multimodal depression detection, termed DepMamba.
arXiv Detail & Related papers (2024-09-24T09:58:07Z) - Turbo your multi-modal classification with contrastive learning [17.983460380784337]
In this paper, we propose a novel contrastive learning strategy, called $Turbo$, to promote multi-modal understanding.
Specifically, multi-modal data pairs are sent through the forward pass twice with different hidden dropout masks to get two different representations for each modality.
With these representations, we obtain multiple in-modal and cross-modal contrastive objectives for training.
arXiv Detail & Related papers (2024-09-14T03:15:34Z) - Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images.
We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem.
We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z) - U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Align and Attend: Multimodal Summarization with Dual Contrastive Losses [57.83012574678091]
The goal of multimodal summarization is to extract the most important information from different modalities to form output summaries.
Existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples.
We introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input.
arXiv Detail & Related papers (2023-03-13T17:01:42Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.