Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled
Expectations in Real-World Applications
- URL: http://arxiv.org/abs/2304.06401v1
- Date: Thu, 13 Apr 2023 11:09:28 GMT
- Title: Why Existing Multimodal Crowd Counting Datasets Can Lead to Unfulfilled
Expectations in Real-World Applications
- Authors: Martin Thi{\ss}en and Elke Hergenr\"other
- Abstract summary: All available multimodal datasets for crowd counting are used to investigate the differences between monomodal and multimodal models.
No general answer to this question can be derived from the existing datasets.
This paper establishes criteria for a potential dataset suitable for answering whether multimodal models perform better in crowd counting in general.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: More information leads to better decisions and predictions, right? Confirming
this hypothesis, several studies concluded that the simultaneous use of optical
and thermal images leads to better predictions in crowd counting. However, the
way multimodal models extract enriched features from both modalities is not yet
fully understood. Since the use of multimodal data usually increases the
complexity, inference time, and memory requirements of the models, it is
relevant to examine the differences and advantages of multimodal compared to
monomodal models. In this work, all available multimodal datasets for crowd
counting are used to investigate the differences between monomodal and
multimodal models. To do so, we designed a monomodal architecture that
considers the current state of research on monomodal crowd counting. In
addition, several multimodal architectures have been developed using different
multimodal learning strategies. The key components of the monomodal
architecture are also used in the multimodal architectures to be able to answer
whether multimodal models perform better in crowd counting in general.
Surprisingly, no general answer to this question can be derived from the
existing datasets. We found that the existing datasets hold a bias toward
thermal images. This was determined by analyzing the relationship between the
brightness of optical images and crowd count as well as examining the
annotations made for each dataset. Since answering this question is important
for future real-world applications of crowd counting, this paper establishes
criteria for a potential dataset suitable for answering whether multimodal
models perform better in crowd counting in general.
Related papers
- U3M: Unbiased Multiscale Modal Fusion Model for Multimodal Semantic Segmentation [63.31007867379312]
We introduce U3M: An Unbiased Multiscale Modal Fusion Model for Multimodal Semantics.
We employ feature fusion at multiple scales to ensure the effective extraction and integration of both global and local features.
Experimental results demonstrate that our approach achieves superior performance across multiple datasets.
arXiv Detail & Related papers (2024-05-24T08:58:48Z) - Sequential Compositional Generalization in Multimodal Models [23.52949473093583]
We conduct a comprehensive assessment of several unimodal and multimodal models.
Our findings reveal that bi-modal and tri-modal models exhibit a clear edge over their text-only counterparts.
arXiv Detail & Related papers (2024-04-18T09:04:15Z) - Multimodal Learning Without Labeled Multimodal Data: Guarantees and Applications [90.6849884683226]
We study the challenge of interaction quantification in a semi-supervised setting with only labeled unimodal data.
Using a precise information-theoretic definition of interactions, our key contribution is the derivation of lower and upper bounds.
We show how these theoretical results can be used to estimate multimodal model performance, guide data collection, and select appropriate multimodal models for various tasks.
arXiv Detail & Related papers (2023-06-07T15:44:53Z) - Multi-scale Cooperative Multimodal Transformers for Multimodal Sentiment
Analysis in Videos [58.93586436289648]
We propose a multi-scale cooperative multimodal transformer (MCMulT) architecture for multimodal sentiment analysis.
Our model outperforms existing approaches on unaligned multimodal sequences and has strong performance on aligned multimodal sequences.
arXiv Detail & Related papers (2022-06-16T07:47:57Z) - Does a Technique for Building Multimodal Representation Matter? --
Comparative Analysis [0.0]
We show that the choice of the technique for building multimodal representation is crucial to obtain the highest possible model's performance.
Experiments are conducted on three datasets: Amazon Reviews, MovieLens25M, and MovieLens1M.
arXiv Detail & Related papers (2022-06-09T21:30:10Z) - Perceptual Score: What Data Modalities Does Your Model Perceive? [73.75255606437808]
We introduce the perceptual score, a metric that assesses the degree to which a model relies on the different subsets of the input features.
We find that recent, more accurate multi-modal models for visual question-answering tend to perceive the visual data less than their predecessors.
Using the perceptual score also helps to analyze model biases by decomposing the score into data subset contributions.
arXiv Detail & Related papers (2021-10-27T12:19:56Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z) - What Makes Multimodal Learning Better than Single (Provably) [28.793128982222438]
We show that learning with multiple modalities achieves a smaller population risk thanonly using its subset of modalities.
This is the first theoretical treatment to capture important qualitative phenomenaobserved in real multimodal applications.
arXiv Detail & Related papers (2021-06-08T17:20:02Z) - Relating by Contrasting: A Data-efficient Framework for Multimodal
Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data.
Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.