Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach
for Speech Emotion Recognition
- URL: http://arxiv.org/abs/2211.08233v3
- Date: Mon, 14 Aug 2023 11:50:53 GMT
- Title: Temporal Modeling Matters: A Novel Temporal Emotional Modeling Approach
for Speech Emotion Recognition
- Authors: Jiaxin Ye, Xin-cheng Wen, Yujie Wei, Yong Xu, Kunhong Liu, Hongming
Shan
- Abstract summary: Speech emotion recognition (SER) plays a vital role in improving interactions between humans and machines.
We introduce a novel temporal emotional modeling approach for SER, termed Temporal-aware bI- Multi-scale Network (TIM-Net)
- Score: 23.13759265661777
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion recognition (SER) plays a vital role in improving the
interactions between humans and machines by inferring human emotion and
affective states from speech signals. Whereas recent works primarily focus on
mining spatiotemporal information from hand-crafted features, we explore how to
model the temporal patterns of speech emotions from dynamic temporal scales.
Towards that goal, we introduce a novel temporal emotional modeling approach
for SER, termed Temporal-aware bI-direction Multi-scale Network (TIM-Net),
which learns multi-scale contextual affective representations from various time
scales. Specifically, TIM-Net first employs temporal-aware blocks to learn
temporal affective representation, then integrates complementary information
from the past and the future to enrich contextual representations, and finally,
fuses multiple time scale features for better adaptation to the emotional
variation. Extensive experimental results on six benchmark SER datasets
demonstrate the superior performance of TIM-Net, gaining 2.34% and 2.61%
improvements of the average UAR and WAR over the second-best on each corpus.
The source code is available at https://github.com/Jiaxin-Ye/TIM-Net_SER.
Related papers
- MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions.
Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones.
Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z) - Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - DWFormer: Dynamic Window transFormer for Speech Emotion Recognition [16.07391331544217]
We propose Dynamic Window transFormer (DWFormer) to locate important regions at different temporal scales.
DWFormer is evaluated on both the IEMOCAP and the MELD datasets.
Experimental results show that the proposed model achieves better performance than the previous state-of-the-art methods.
arXiv Detail & Related papers (2023-03-03T03:26:53Z) - GM-TCNet: Gated Multi-scale Temporal Convolutional Network using Emotion
Causality for Speech Emotion Recognition [14.700043991797537]
We propose a Gated Multi-scale Temporal Convolutional Network (GM-TCNet) to construct a novel emotional causality representation learning component.
GM-TCNet deploys a novel emotional causality representation learning component to capture the dynamics of emotion across the time domain.
Our model maintains the highest performance in most cases compared to state-of-the-art techniques.
arXiv Detail & Related papers (2022-10-28T02:00:40Z) - MSA-GCN:Multiscale Adaptive Graph Convolution Network for Gait Emotion
Recognition [6.108523790270448]
We present a novel Multi Scale Adaptive Graph Convolution Network (MSA-GCN) to recognize emotions.
In our model, a adaptive selective spatial-temporal convolution is designed to select the convolution kernel dynamically to obtain the soft-temporal features of different emotions.
Compared with previous state-of-the-art methods, the proposed method achieves the best performance on two public datasets.
arXiv Detail & Related papers (2022-09-19T13:07:16Z) - Dilated Context Integrated Network with Cross-Modal Consensus for
Temporal Emotion Localization in Videos [128.70585652795637]
TEL presents three unique challenges compared to temporal action localization.
The emotions have extremely varied temporal dynamics.
The fine-grained temporal annotations are complicated and labor-intensive.
arXiv Detail & Related papers (2022-08-03T10:00:49Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Temporal aggregation of audio-visual modalities for emotion recognition [0.5352699766206808]
We propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality.
Our proposed method outperforms other methods from the literature and human accuracy rating.
arXiv Detail & Related papers (2020-07-08T18:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.