Multi-modal Residual Perceptron Network for Audio-Video Emotion
Recognition
- URL: http://arxiv.org/abs/2107.10742v1
- Date: Wed, 21 Jul 2021 13:11:37 GMT
- Title: Multi-modal Residual Perceptron Network for Audio-Video Emotion
Recognition
- Authors: Xin Chang and W{\l}adys{\l}aw Skarbek
- Abstract summary: We propose a multi-modal Residual Perceptron Network (MRPN) which learns from multi-modal network branches creating deep feature representation with reduced noise.
For the proposed MRPN model and the novel time augmentation for streamed digital movies, the state-of-art average recognition rate was improved to 91.4%.
- Score: 0.22843885788439797
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Emotion recognition is an important research field for Human-Computer
Interaction(HCI). Audio-Video Emotion Recognition (AVER) is now attacked with
Deep Neural Network (DNN) modeling tools. In published papers, as a rule, the
authors show only cases of the superiority of multi modalities over audio-only
or video-only modalities. However, there are cases superiority in single
modality can be found. In our research, we hypothesize that for fuzzy
categories of emotional events, the higher noise of one modality can amplify
the lower noise of the second modality represented indirectly in the parameters
of the modeling neural network. To avoid such cross-modal information
interference we define a multi-modal Residual Perceptron Network (MRPN) which
learns from multi-modal network branches creating deep feature representation
with reduced noise. For the proposed MRPN model and the novel time augmentation
for streamed digital movies, the state-of-art average recognition rate was
improved to 91.4% for The Ryerson Audio-Visual Database of Emotional Speech and
Song(RAVDESS) dataset and to 83.15% for Crowd-sourced Emotional multi-modal
Actors Dataset(Crema-d). Moreover, the MRPN concept shows its potential for
multi-modal classifiers dealing with signal sources not only of optical and
acoustical type.
Related papers
- Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment [11.063156506583562]
This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions.
Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis.
arXiv Detail & Related papers (2024-09-14T21:58:39Z) - A Study of Dropout-Induced Modality Bias on Robustness to Missing Video
Frames for Audio-Visual Speech Recognition [53.800937914403654]
Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames.
While applying the dropout technique to the video modality enhances robustness to missing frames, it simultaneously results in a performance loss when dealing with complete data input.
We propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality.
arXiv Detail & Related papers (2024-03-07T06:06:55Z) - Hypernetworks build Implicit Neural Representations of Sounds [18.28957270390735]
Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering.
Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models.
We introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training.
Our approach reconstructs audio samples with quality comparable to other state
arXiv Detail & Related papers (2023-02-09T22:24:26Z) - Modality-Agnostic Variational Compression of Implicit Neural
Representations [96.35492043867104]
We introduce a modality-agnostic neural compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR)
Bridging the gap between latent coding and sparsity, we obtain compact latent representations non-linearly mapped to a soft gating mechanism.
After obtaining a dataset of such latent representations, we directly optimise the rate/distortion trade-off in a modality-agnostic space using neural compression.
arXiv Detail & Related papers (2023-01-23T15:22:42Z) - HyperSound: Generating Implicit Neural Representations of Audio Signals
with Hypernetworks [23.390919506056502]
Implicit neural representations (INRs) are a rapidly growing research field, which provides alternative ways to represent multimedia signals.
We propose HyperSound, a meta-learning method leveraging hypernetworks to produce INRs for audio signals unseen at training time.
We show that our approach can reconstruct sound waves with quality comparable to other state-of-the-art models.
arXiv Detail & Related papers (2022-11-03T14:20:32Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - DeepMSRF: A novel Deep Multimodal Speaker Recognition framework with
Feature selection [2.495606047371841]
We propose DeepMSRF, Deep Multimodal Speaker Recognition with Feature selection.
We execute DeepMSRF by feeding features of the two modalities, namely speakers' audios and face images.
The goal of DeepMSRF is to identify the gender of the speaker first, and further to recognize his or her name for any given video stream.
arXiv Detail & Related papers (2020-07-14T04:28:12Z) - Self-Supervised MultiModal Versatile Networks [76.19886740072808]
We learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams.
We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks.
arXiv Detail & Related papers (2020-06-29T17:50:23Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.