A Study of Multimodal Person Verification Using Audio-Visual-Thermal
Data
- URL: http://arxiv.org/abs/2110.12136v1
- Date: Sat, 23 Oct 2021 04:41:03 GMT
- Title: A Study of Multimodal Person Verification Using Audio-Visual-Thermal
Data
- Authors: Madina Abdrakhmanova, Saniya Abushakimova, Yerbolat Khassanov, and
Huseyin Atakan Varol
- Abstract summary: We study an approach to multimodal person verification using audio, visual, and thermal modalities.
We implement unimodal, bimodal, and trimodal verification systems using the state-of-the-art deep learning architectures.
- Score: 4.149096351426994
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we study an approach to multimodal person verification using
audio, visual, and thermal modalities. The combination of audio and visual
modalities has already been shown to be effective for robust person
verification. From this perspective, we investigate the impact of further
increasing the number of modalities by supplementing thermal images. In
particular, we implemented unimodal, bimodal, and trimodal verification systems
using the state-of-the-art deep learning architectures and compared their
performance under clean and noisy conditions. We also compared two popular
fusion approaches based on simple score averaging and soft attention mechanism.
The experiment conducted on the SpeakingFaces dataset demonstrates the
superiority of the trimodal verification system over both unimodal and bimodal
systems. To enable the reproducibility of the experiment and facilitate
research into multimodal person verification, we make our code, pretrained
models and preprocessed dataset freely available in our GitHub repository.
Related papers
- Diff-2-in-1: Bridging Generation and Dense Perception with Diffusion Models [39.127620891450526]
We introduce a unified, versatile, diffusion-based framework, Diff-2-in-1, to handle both multi-modal data generation and dense visual perception.
We further enhance discriminative visual perception via multi-modal generation, by utilizing the denoising network to create multi-modal data that mirror the distribution of the original training set.
arXiv Detail & Related papers (2024-11-07T18:59:53Z) - Unveiling and Mitigating Bias in Audio Visual Segmentation [9.427676046134374]
Community researchers have developed a range of advanced audio-visual segmentation models to improve the quality of sounding objects' masks.
While masks created by these models may initially appear plausible, they occasionally exhibit anomalies with incorrect grounding logic.
We attribute this to real-world inherent preferences and distributions as a simpler signal for learning than the complex audio-visual grounding.
arXiv Detail & Related papers (2024-07-23T16:55:04Z) - Joint Multimodal Transformer for Emotion Recognition in the Wild [49.735299182004404]
Multimodal emotion recognition (MMER) systems typically outperform unimodal systems.
This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention.
arXiv Detail & Related papers (2024-03-15T17:23:38Z) - Exploiting Modality-Specific Features For Multi-Modal Manipulation
Detection And Grounding [54.49214267905562]
We construct a transformer-based framework for multi-modal manipulation detection and grounding tasks.
Our framework simultaneously explores modality-specific features while preserving the capability for multi-modal alignment.
We propose an implicit manipulation query (IMQ) that adaptively aggregates global contextual cues within each modality.
arXiv Detail & Related papers (2023-09-22T06:55:41Z) - StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized
Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning.
This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models.
Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z) - Learning Audio-Visual embedding for Wild Person Verification [18.488385598522125]
We propose an audio-visual network that considers aggregator from a fusion perspective.
We introduce improved attentive statistics pooling for the first time in face verification.
Finally, fuse the modality with a gated attention mechanism.
arXiv Detail & Related papers (2022-09-09T02:29:47Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Self-attention fusion for audiovisual emotion recognition with
incomplete data [103.70855797025689]
We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
arXiv Detail & Related papers (2022-01-26T18:04:29Z) - Multi-Modal Perception Attention Network with Self-Supervised Learning
for Audio-Visual Speaker Tracking [18.225204270240734]
We propose a novel Multi-modal Perception Tracker (MPT) for speaker tracking using both audio and visual modalities.
MPT achieves 98.6% and 78.3% tracking accuracy on the standard and occluded datasets, respectively.
arXiv Detail & Related papers (2021-12-14T14:14:17Z) - Attention Bottlenecks for Multimodal Fusion [90.75885715478054]
Machine perception models are typically modality-specific and optimised for unimodal benchmarks.
We introduce a novel transformer based architecture that uses fusion' for modality fusion at multiple layers.
We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks.
arXiv Detail & Related papers (2021-06-30T22:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.