FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video
Emotion Recognition Inference
- URL: http://arxiv.org/abs/2209.10170v1
- Date: Wed, 21 Sep 2022 08:05:26 GMT
- Title: FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video
Emotion Recognition Inference
- Authors: Qinglan Wei, Xuling Huang, Yuan Zhang
- Abstract summary: In this paper, we design a fully multimodal video-to-emotion system (FV2ES) for fast yet effective recognition inference.
The adoption of the hierarchical attention method upon the sound spectra breaks through the limited contribution of the acoustic modality.
The further integration of data pre-processing into the aligned multimodal learning model allows the significant reduction of computational costs and storage space.
- Score: 6.279057784373124
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the latest social networks, more and more people prefer to express their
emotions in videos through text, speech, and rich facial expressions.
Multimodal video emotion analysis techniques can help understand users' inner
world automatically based on human expressions and gestures in images, tones in
voices, and recognized natural language. However, in the existing research, the
acoustic modality has long been in a marginal position as compared to visual
and textual modalities. That is, it tends to be more difficult to improve the
contribution of the acoustic modality for the whole multimodal emotion
recognition task. Besides, although better performance can be obtained by
introducing common deep learning methods, the complex structures of these
training models always result in low inference efficiency, especially when
exposed to high-resolution and long-length videos. Moreover, the lack of a
fully end-to-end multimodal video emotion recognition system hinders its
application. In this paper, we designed a fully multimodal video-to-emotion
system (named FV2ES) for fast yet effective recognition inference, whose
benefits are threefold: (1) The adoption of the hierarchical attention method
upon the sound spectra breaks through the limited contribution of the acoustic
modality and outperforms the existing models' performance on both IEMOCAP and
CMU-MOSEI datasets; (2) the introduction of the idea of multi-scale for visual
extraction while single-branch for inference brings higher efficiency and
maintains the prediction accuracy at the same time; (3) the further integration
of data pre-processing into the aligned multimodal learning model allows the
significant reduction of computational costs and storage space.
Related papers
- Multi-Microphone and Multi-Modal Emotion Recognition in Reverberant Environment [11.063156506583562]
This paper presents a Multi-modal Emotion Recognition (MER) system designed to enhance emotion recognition accuracy in challenging acoustic conditions.
Our approach combines a modified and extended Hierarchical Token-semantic Audio Transformer (HTS-AT) for multi-channel audio processing with an R(2+1)D Convolutional Neural Networks (CNN) model for video analysis.
arXiv Detail & Related papers (2024-09-14T21:58:39Z) - MicroEmo: Time-Sensitive Multimodal Emotion Recognition with Micro-Expression Dynamics in Video Dialogues [0.0]
We propose a time-sensitive Multimodal Large Language Model (MLLM) aimed at directing attention to the local facial micro-expression dynamics.
Our model incorporates two key architectural contributions: (1) a global-local attention visual encoder that integrates global frame-level timestamp-bound image features with local facial features of temporal dynamics of micro-expressions; and (2) an utterance-aware video Q-Former that captures multi-scale and contextual dependencies by generating visual token sequences for each utterance segment and for the entire video then combining them.
arXiv Detail & Related papers (2024-07-23T15:05:55Z) - CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion [58.15403987979496]
CREMA is a generalizable, highly efficient, and modular modality-fusion framework for video reasoning.
We propose a novel progressive multimodal fusion design supported by a lightweight fusion module and modality-sequential training strategy.
We validate our method on 7 video-language reasoning tasks assisted by diverse modalities, including VideoQA and Video-Audio/3D/Touch/Thermal QA.
arXiv Detail & Related papers (2024-02-08T18:27:22Z) - Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent.
Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z) - Improving Audio-Visual Speech Recognition by Lip-Subword Correlation
Based Visual Pre-training and Cross-Modal Fusion Encoder [58.523884148942166]
We propose two novel techniques to improve audio-visual speech recognition (AVSR) under a pre-training and fine-tuning training framework.
First, we explore the correlation between lip shapes and syllable-level subword units in Mandarin to establish good frame-level syllable boundaries from lip shapes.
Next, we propose an audio-guided cross-modal fusion encoder (CMFE) neural network to utilize main training parameters for multiple cross-modal attention layers.
arXiv Detail & Related papers (2023-08-14T08:19:24Z) - Video-based Person Re-identification with Long Short-Term Representation
Learning [101.62570747820541]
Video-based person Re-Identification (V-ReID) aims to retrieve specific persons from raw videos captured by non-overlapped cameras.
We propose a novel deep learning framework named Long Short-Term Representation Learning (LSTRL) for effective V-ReID.
arXiv Detail & Related papers (2023-08-07T16:22:47Z) - Versatile audio-visual learning for emotion recognition [28.26077129002198]
This study proposes a versatile audio-visual learning framework for handling unimodal and multimodal systems.
We achieve this effective representation learning with audio-visual shared layers, residual connections over shared layers, and a unimodal reconstruction task.
Notably, VAVL attains a new state-of-the-art performance in the emotional prediction task on the MSP-IMPROV corpus.
arXiv Detail & Related papers (2023-05-12T03:13:37Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Deep Auto-Encoders with Sequential Learning for Multimodal Dimensional
Emotion Recognition [38.350188118975616]
We propose a novel deep neural network architecture consisting of a two-stream auto-encoder and a long short term memory for emotion recognition.
We carry out extensive experiments on the multimodal emotion in the wild dataset: RECOLA.
Experimental results show that the proposed method achieves state-of-the-art recognition performance and surpasses existing schemes by a significant margin.
arXiv Detail & Related papers (2020-04-28T01:25:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.