MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
- URL: http://arxiv.org/abs/2306.09361v3
- Date: Wed, 26 Jun 2024 04:54:14 GMT
- Title: MFSN: Multi-perspective Fusion Search Network For Pre-training Knowledge in Speech Emotion Recognition
- Authors: Haiyang Sun, Fulin Zhang, Yingying Gao, Zheng Lian, Shilei Zhang, Junlan Feng,
- Abstract summary: Speech Emotion Recognition (SER) is an important research topic in human-computer interaction.
We propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN)
Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC)
- Score: 18.38506185117551
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech Emotion Recognition (SER) is an important research topic in human-computer interaction. Many recent works focus on directly extracting emotional cues through pre-trained knowledge, frequently overlooking considerations of appropriateness and comprehensiveness. Therefore, we propose a novel framework for pre-training knowledge in SER, called Multi-perspective Fusion Search Network (MFSN). Considering comprehensiveness, we partition speech knowledge into Textual-related Emotional Content (TEC) and Speech-related Emotional Content (SEC), capturing cues from both semantic and acoustic perspectives, and we design a new architecture search space to fully leverage them. Considering appropriateness, we verify the efficacy of different modeling approaches in capturing SEC and fills the gap in current research. Experimental results on multiple datasets demonstrate the superiority of MFSN.
Related papers
- Layer-Wise Analysis of Self-Supervised Acoustic Word Embeddings: A Study
on Speech Emotion Recognition [54.952250732643115]
We study Acoustic Word Embeddings (AWEs), a fixed-length feature derived from continuous representations, to explore their advantages in specific tasks.
AWEs have previously shown utility in capturing acoustic discriminability.
Our findings underscore the acoustic context conveyed by AWEs and showcase the highly competitive Speech Emotion Recognition accuracies.
arXiv Detail & Related papers (2024-02-04T21:24:54Z) - TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation [0.78452977096722]
TelME incorporates cross-modal knowledge distillation to transfer information from a language model acting as the teacher to the non-verbal students.
We then combine multimodal features using a shifting fusion approach in which student networks support the teacher.
arXiv Detail & Related papers (2024-01-16T07:18:41Z) - MSAC: Multiple Speech Attribute Control Method for Reliable Speech Emotion Recognition [7.81011775615268]
We introduce MSAC-SERNet, a novel unified SER framework capable of simultaneously handling both single-corpus and cross-corpus SER.
Considering information overlap between various speech attributes, we propose a novel learning paradigm based on correlations of different speech attributes.
Experiments on both single-corpus and cross-corpus SER scenarios indicate that MSAC-SERNet achieves superior performance compared to state-of-the-art SER approaches.
arXiv Detail & Related papers (2023-08-08T03:43:24Z) - EMERSK -- Explainable Multimodal Emotion Recognition with Situational
Knowledge [0.0]
We present Explainable Multimodal Emotion Recognition with Situational Knowledge (EMERSK)
EMERSK is a general system for human emotion recognition and explanation using visual information.
Our system can handle multiple modalities, including facial expressions, posture, and gait in a flexible and modular manner.
arXiv Detail & Related papers (2023-06-14T17:52:37Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - M2FNet: Multi-modal Fusion Network for Emotion Recognition in
Conversation [1.3864478040954673]
We propose a Multi-modal Fusion Network (M2FNet) that extracts emotion-relevant features from visual, audio, and text modality.
It employs a multi-head attention-based fusion mechanism to combine emotion-rich latent representations of the input data.
The proposed feature extractor is trained with a novel adaptive margin-based triplet loss function to learn emotion-relevant features from the audio and visual data.
arXiv Detail & Related papers (2022-06-05T14:18:58Z) - MMER: Multimodal Multi-task Learning for Speech Emotion Recognition [48.32879363033598]
MMER is a novel Multimodal Multi-task learning approach for Speech Emotion Recognition.
In practice, MMER achieves all our baselines and state-of-the-art performance on the IEMOCAP benchmark.
arXiv Detail & Related papers (2022-03-31T04:51:32Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - MEmoBERT: Pre-training Model with Prompt-based Learning for Multimodal
Emotion Recognition [118.73025093045652]
We propose a pre-training model textbfMEmoBERT for multimodal emotion recognition.
Unlike the conventional "pre-train, finetune" paradigm, we propose a prompt-based method that reformulates the downstream emotion classification task as a masked text prediction.
Our proposed MEmoBERT significantly enhances emotion recognition performance.
arXiv Detail & Related papers (2021-10-27T09:57:00Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Temporal aggregation of audio-visual modalities for emotion recognition [0.5352699766206808]
We propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality.
Our proposed method outperforms other methods from the literature and human accuracy rating.
arXiv Detail & Related papers (2020-07-08T18:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.