Versatile Audio-Visual Learning for Handling Single and Multi Modalities
in Emotion Regression and Classification Tasks
- URL: http://arxiv.org/abs/2305.07216v1
- Date: Fri, 12 May 2023 03:13:37 GMT
- Title: Versatile Audio-Visual Learning for Handling Single and Multi Modalities
in Emotion Regression and Classification Tasks
- Authors: Lucas Goncalves, Seong-Gyun Leem, Wei-Cheng Lin, Berrak Sisman, Carlos
Busso
- Abstract summary: This study proposes a emphversatile audio-visual learning (VAVL) framework for handling unimodal and multimodal systems.
We implement an audio-visual framework that can be trained even when audio and visual paired data is not available.
VAVL attains a new state-of-the-art performance in the emotional attribute prediction task on the MSP-IMPROV corpus.
- Score: 28.03046198108713
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Most current audio-visual emotion recognition models lack the flexibility
needed for deployment in practical applications. We envision a multimodal
system that works even when only one modality is available and can be
implemented interchangeably for either predicting emotional attributes or
recognizing categorical emotions. Achieving such flexibility in a multimodal
emotion recognition system is difficult due to the inherent challenges in
accurately interpreting and integrating varied data sources. It is also a
challenge to robustly handle missing or partial information while allowing
direct switch between regression and classification tasks. This study proposes
a \emph{versatile audio-visual learning} (VAVL) framework for handling unimodal
and multimodal systems for emotion regression and emotion classification tasks.
We implement an audio-visual framework that can be trained even when audio and
visual paired data is not available for part of the training set (i.e., audio
only or only video is present). We achieve this effective representation
learning with audio-visual shared layers, residual connections over shared
layers, and a unimodal reconstruction task. Our experimental results reveal
that our architecture significantly outperforms strong baselines on both the
CREMA-D and MSP-IMPROV corpora. Notably, VAVL attains a new state-of-the-art
performance in the emotional attribute prediction task on the MSP-IMPROV
corpus. Code available at: https://github.com/ilucasgoncalves/VAVL
Related papers
- Adversarial Representation with Intra-Modal and Inter-Modal Graph
Contrastive Learning for Multimodal Emotion Recognition [15.4676247289299]
We propose a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method.
Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces.
Secondly, we build a generator and a discriminator for the three modal features through adversarial representation.
Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information.
arXiv Detail & Related papers (2023-12-28T01:57:26Z) - EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z) - Multimodal Emotion Recognition with Modality-Pairwise Unsupervised
Contrastive Loss [80.79641247882012]
We focus on unsupervised feature learning for Multimodal Emotion Recognition (MER)
We consider discrete emotions, and as modalities text, audio and vision are used.
Our method, as being based on contrastive loss between pairwise modalities, is the first attempt in MER literature.
arXiv Detail & Related papers (2022-07-23T10:11:24Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - i-Code: An Integrative and Composable Multimodal Learning Framework [99.56065789066027]
i-Code is a self-supervised pretraining framework where users may flexibly combine the modalities of vision, speech, and language into unified and general-purpose vector representations.
The entire system is pretrained end-to-end with new objectives including masked modality unit modeling and cross-modality contrastive learning.
Experimental results demonstrate how i-Code can outperform state-of-the-art techniques on five video understanding tasks and the GLUE NLP benchmark, improving by as much as 11%.
arXiv Detail & Related papers (2022-05-03T23:38:50Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Audio-Visual Fusion Layers for Event Type Aware Video Recognition [86.22811405685681]
We propose a new model to address the multisensory integration problem with individual event-specific layers in a multi-task learning scheme.
We show that our network is formulated with single labels, but it can output additional true multi-labels to represent the given videos.
arXiv Detail & Related papers (2022-02-12T02:56:22Z) - TriBERT: Full-body Human-centric Audio-visual Representation Learning
for Visual Sound Separation [35.93516937521393]
We introduce TriBERT -- a transformer-based architecture inspired by ViLBERT.
TriBERT enables contextual feature learning across three modalities: vision, pose, and audio.
We show that the learned TriBERT representations are generic and significantly improve performance on other audio-visual tasks.
arXiv Detail & Related papers (2021-10-26T04:50:42Z) - Distilling Audio-Visual Knowledge by Compositional Contrastive Learning [51.20935362463473]
We learn a compositional embedding that closes the cross-modal semantic gap.
We establish a new, comprehensive multi-modal distillation benchmark on three video datasets.
arXiv Detail & Related papers (2021-04-22T09:31:20Z) - Self-Supervised learning with cross-modal transformers for emotion
recognition [20.973999078271483]
Self-supervised learning has shown improvements on tasks with limited labeled datasets in domains like speech and natural language.
In this work, we extend self-supervised training to multi-modal applications.
arXiv Detail & Related papers (2020-11-20T21:38:34Z) - Visually Guided Self Supervised Learning of Speech Representations [62.23736312957182]
We propose a framework for learning audio representations guided by the visual modality in the context of audiovisual speech.
We employ a generative audio-to-video training scheme in which we animate a still image corresponding to a given audio clip and optimize the generated video to be as close as possible to the real video of the speech segment.
We achieve state of the art results for emotion recognition and competitive results for speech recognition.
arXiv Detail & Related papers (2020-01-13T14:53:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.