GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion
Causality for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2210.15834v1
- Date: Fri, 28 Oct 2022 02:00:40 GMT
- Title: GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion
Causality for Speech Emotion Recognition
- Authors: Jia-Xin Ye, Xin-Cheng Wen, Xuan-Ze Wang, Yong Xu, Yan Luo, Chang-Li
Wu, Li-Yan Chen, Kun-Hong Liu
- Abstract summary: We propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component.
GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain.
Our model maintains the highest performance in most cases compared to state-of-the-art techniques.
- Score: 14.700043991797537
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In human-computer interaction, Speech Emotion Recognition (SER) plays an
essential role in understanding the user's intent and improving the interactive
experience. While similar sentimental speeches own diverse speaker
characteristics but share common antecedents and consequences, an essential
challenge for SER is how to produce robust and discriminative representations
through causality between speech emotions. In this paper, we propose a Gated
Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel
emotional causality representation learning component with a multi-scale
receptive field. GM-TCNet deploys a novel emotional causality representation
learning component to capture the dynamics of emotion across the time domain,
constructed with dilated causal convolution layer and gating mechanism.
Besides, it utilizes skip connection fusing high-level features from different
gated convolution blocks to capture abundant and subtle emotion changes in
human speech. GM-TCNet first uses a single type of feature, mel-frequency
cepstral coefficients, as inputs and then passes them through the gated
temporal convolutional module to generate the high-level features. Finally, the
features are fed to the emotion classifier to accomplish the SER task. The
experimental results show that our model maintains the highest performance in
most cases compared to state-of-the-art techniques.
Related papers
- Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Dynamic Causal Disentanglement Model for Dialogue Emotion Detection [77.96255121683011]
We propose a Dynamic Causal Disentanglement Model based on hidden variable separation.
This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions.
Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables.
arXiv Detail & Related papers (2023-09-13T12:58:09Z) - EmotionIC: emotional inertia and contagion-driven dependency modeling for emotion recognition in conversation [34.24557248359872]
We propose an emotional inertia and contagion-driven dependency modeling approach (EmotionIC) for ERC task.
Our EmotionIC consists of three main components, i.e., Identity Masked Multi-Head Attention (IMMHA), Dialogue-based Gated Recurrent Unit (DiaGRU) and Skip-chain Conditional Random Field (SkipCRF)
Experimental results show that our method can significantly outperform the state-of-the-art models on four benchmark datasets.
arXiv Detail & Related papers (2023-03-20T13:58:35Z) - Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach
for Speech Emotion Recognition [23.13759265661777]
Speech emotion recognition (SER) plays a vital role in improving interactions between humans and machines.
We introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI- Multi-scale Network (TIM-Net)
arXiv Detail & Related papers (2022-11-14T13:35:01Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Emotion Recognition from Multiple Modalities: Fundamentals and
Methodologies [106.62835060095532]
We discuss several key aspects of multi-modal emotion recognition (MER)
We begin with a brief introduction on widely used emotion representation models and affective modalities.
We then summarize existing emotion annotation strategies and corresponding computational tasks.
Finally, we outline several real-world applications and discuss some future directions.
arXiv Detail & Related papers (2021-08-18T21:55:20Z) - Seen and Unseen emotional style transfer for voice conversion with a new
emotional speech dataset [84.53659233967225]
Emotional voice conversion aims to transform emotional prosody in speech while preserving the linguistic content and speaker identity.
We propose a novel framework based on variational auto-encoding Wasserstein generative adversarial network (VAW-GAN)
We show that the proposed framework achieves remarkable performance by consistently outperforming the baseline framework.
arXiv Detail & Related papers (2020-10-28T07:16:18Z) - Detecting Emotion Primitives from Speech and their use in discerning
Categorical Emotions [16.886826928295203]
Emotion plays an essential role in human-to-human communication, enabling us to convey feelings such as happiness, frustration, and sincerity.
This work investigated how emotion primitives can be used to detect categorical emotions such as happiness, disgust, contempt, anger, and surprise from neutral speech.
Results indicated that arousal, followed by dominance was a better detector of such emotions.
arXiv Detail & Related papers (2020-01-31T03:11:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.