Related papers: Construction and Evaluation of Mandarin Multimodal Emotional Speech Database

Construction and Evaluation of Mandarin Multimodal Emotional Speech Database

URL: http://arxiv.org/abs/2401.07336v1
Date: Sun, 14 Jan 2024 17:56:36 GMT
Title: Construction and Evaluation of Mandarin Multimodal Emotional Speech Database
Authors: Zhu Ting, Li Liangqi, Duan Shufei, Zhang Xueying, Xiao Zhongzhe, Jia Hairng, Liang Huizhi
Abstract summary: The validity of dimension annotation is verified by statistical analysis of dimension annotation data. The recognition rate of seven emotions is about 82% when using acoustic data alone. The database is of high quality and can be used as an important source for speech analysis research.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A multi-modal emotional speech Mandarin database including articulatory kinematics, acoustics, glottal and facial micro-expressions is designed and established, which is described in detail from the aspects of corpus design, subject selection, recording details and data processing. Where signals are labeled with discrete emotion labels (neutral, happy, pleasant, indifferent, angry, sad, grief) and dimensional emotion labels (pleasure, arousal, dominance). In this paper, the validity of dimension annotation is verified by statistical analysis of dimension annotation data. The SCL-90 scale data of annotators are verified and combined with PAD annotation data for analysis, so as to explore the internal relationship between the outlier phenomenon in annotation and the psychological state of annotators. In order to verify the speech quality and emotion discrimination of the database, this paper uses 3 basic models of SVM, CNN and DNN to calculate the recognition rate of these seven emotions. The results show that the average recognition rate of seven emotions is about 82% when using acoustic data alone. When using glottal data alone, the average recognition rate is about 72%. Using kinematics data alone, the average recognition rate also reaches 55.7%. Therefore, the database is of high quality and can be used as an important source for speech analysis research, especially for the task of multimodal emotional speech analysis.

Related papers

Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture [0.0]
This paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction. The architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively.
arXiv Detail & Related papers (2025-01-18T06:15:54Z)
Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction. We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics. We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z)
Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge. We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z)
EMOVOME Database: Advancing Emotion Recognition in Speech Beyond Staged Scenarios [2.1455880234227624]
We released the Emotional Voice Messages (EMOVOME) database, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app. We evaluated speaker-independent Speech Emotion Recognition (SER) models using a standard set of acoustic features and transformer-based models. EMOVOME outcomes varied with annotator labels, showing better results and fairness when combining expert and non-expert annotations.
arXiv Detail & Related papers (2024-03-04T16:13:39Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Design, construction and evaluation of emotional multimodal pathological speech database [8.774681418339155]
The first Chinese multimodal emotional pathological speech database containing multi-perspective information is constructed. All emotional speech was labeled for intelligibility, types and discrete dimensional emotions by developed WeChat mini-program. The automatic recognition tested on speech and glottal data, with average accuracy of 78% for controls and 60% for patients in audio, while 51% for controls and 38% for patients in glottal data, indicating an influence of the disease on emotional expression.
arXiv Detail & Related papers (2023-12-14T14:43:31Z)
SER_AMPEL: a multi-source dataset for speech emotion recognition of Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER) It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults. The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z)
EMNS /Imz/ Corpus: An emotive single-speaker dataset for narrative storytelling in games, television and graphic novels [6.2375553155844266]
The Emotive Narrative Storytelling (EMNS) corpus is a unique speech dataset created to enhance conversations' emotive quality. It consists of a 2.3-hour recording featuring a female speaker delivering labelled utterances. It encompasses eight acted emotional states, evenly distributed with a variance of 0.68%, along with expressiveness levels and natural language descriptions with word emphasis labels.
arXiv Detail & Related papers (2023-05-22T15:32:32Z)
Feature Selection Enhancement and Feature Space Visualization for Speech-Based Emotion Recognition [2.223733768286313]
We present speech features enhancement strategy that improves speech emotion recognition. The strategy is compared with the state-of-the-art methods used in the literature. Our method achieved an average recognition gain of 11.5% for six out of seven emotions for the EMO-DB dataset, and 13.8% for seven out of eight emotions for the RAVDESS dataset.
arXiv Detail & Related papers (2022-08-19T11:29:03Z)
BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis [9.95713767110021]
Body-Expression-Audio-Text dataset has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages. BEAT is the largest motion capture dataset for investigating the human gestures.
arXiv Detail & Related papers (2022-03-10T11:19:52Z)
Multimodal Emotion Recognition using Transfer Learning from Speaker Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities. We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Affect2MM: Affective Analysis of Multimedia Content Using Emotion Causality [84.69595956853908]
We present Affect2MM, a learning method for time-series emotion prediction for multimedia content. Our goal is to automatically capture the varying emotions depicted by characters in real-life human-centric situations and behaviors.
arXiv Detail & Related papers (2021-03-11T09:07:25Z)
Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv. It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation. The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.