Self-attention fusion for audiovisual emotion recognition with
incomplete data
- URL: http://arxiv.org/abs/2201.11095v1
- Date: Wed, 26 Jan 2022 18:04:29 GMT
- Title: Self-attention fusion for audiovisual emotion recognition with
incomplete data
- Authors: Kateryna Chumachenko, Alexandros Iosifidis, Moncef Gabbouj
- Abstract summary: We consider the problem of multimodal data analysis with a use case of audiovisual emotion recognition.
We propose an architecture capable of learning from raw data and describe three variants of it with distinct modality fusion mechanisms.
- Score: 103.70855797025689
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we consider the problem of multimodal data analysis with a use
case of audiovisual emotion recognition. We propose an architecture capable of
learning from raw data and describe three variants of it with distinct modality
fusion mechanisms. While most of the previous works consider the ideal scenario
of presence of both modalities at all times during inference, we evaluate the
robustness of the model in the unconstrained settings where one modality is
absent or noisy, and propose a method to mitigate these limitations in a form
of modality dropout. Most importantly, we find that following this approach not
only improves performance drastically under the absence/noisy representations
of one modality, but also improves the performance in a standard ideal setting,
outperforming the competing methods.
Related papers
- Robust Latent Representation Tuning for Image-text Classification [9.789498730131607]
We propose a robust latent representation tuning method for large models.
Our approach introduces a modality latent translation module to maximize the correlation between modalities, resulting in a robust representation.
Within this framework, common semantics are refined during training, and robust performance is achieved even in the absence of one modality.
arXiv Detail & Related papers (2024-06-10T06:29:00Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Diffusion Action Segmentation [63.061058214427085]
We propose a novel framework via denoising diffusion models, which shares the same inherent spirit of such iterative refinement.
In this framework, action predictions are iteratively generated from random noise with input video features as conditions.
arXiv Detail & Related papers (2023-03-31T10:53:24Z) - Understanding and Constructing Latent Modality Structures in Multi-modal
Representation Learning [53.68371566336254]
We argue that the key to better performance lies in meaningful latent modality structures instead of perfect modality alignment.
Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization.
arXiv Detail & Related papers (2023-03-10T14:38:49Z) - Flexible-modal Deception Detection with Audio-Visual Adapter [20.6514221670249]
We propose a novel framework to fuse temporal features across two modalities efficiently.
Experiments conducted on two benchmark datasets demonstrate that the proposed method can achieve superior performance.
arXiv Detail & Related papers (2023-02-11T15:47:20Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z) - On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification.
We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned.
Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z) - Discriminative Multimodal Learning via Conditional Priors in Generative
Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training.
We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z) - Robust Latent Representations via Cross-Modal Translation and Alignment [36.67937514793215]
Most multi-modal machine learning methods require that all the modalities used for training are also available for testing.
To address this limitation, we aim to improve the testing performance of uni-modal systems using multiple modalities during training only.
The proposed multi-modal training framework uses cross-modal translation and correlation-based latent space alignment.
arXiv Detail & Related papers (2020-11-03T11:18:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.