Construction and Evaluation of Mandarin Multimodal Emotional Speech
Database
- URL: http://arxiv.org/abs/2401.07336v1
- Date: Sun, 14 Jan 2024 17:56:36 GMT
- Title: Construction and Evaluation of Mandarin Multimodal Emotional Speech
Database
- Authors: Zhu Ting, Li Liangqi, Duan Shufei, Zhang Xueying, Xiao Zhongzhe, Jia
Hairng, Liang Huizhi
- Abstract summary: The validity of dimension annotation is verified by statistical analysis of dimension annotation data.
The recognition rate of seven emotions is about 82% when using acoustic data alone.
The database is of high quality and can be used as an important source for speech analysis research.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A multi-modal emotional speech Mandarin database including articulatory
kinematics, acoustics, glottal and facial micro-expressions is designed and
established, which is described in detail from the aspects of corpus design,
subject selection, recording details and data processing. Where signals are
labeled with discrete emotion labels (neutral, happy, pleasant, indifferent,
angry, sad, grief) and dimensional emotion labels (pleasure, arousal,
dominance). In this paper, the validity of dimension annotation is verified by
statistical analysis of dimension annotation data. The SCL-90 scale data of
annotators are verified and combined with PAD annotation data for analysis, so
as to explore the internal relationship between the outlier phenomenon in
annotation and the psychological state of annotators. In order to verify the
speech quality and emotion discrimination of the database, this paper uses 3
basic models of SVM, CNN and DNN to calculate the recognition rate of these
seven emotions. The results show that the average recognition rate of seven
emotions is about 82% when using acoustic data alone. When using glottal data
alone, the average recognition rate is about 72%. Using kinematics data alone,
the average recognition rate also reaches 55.7%. Therefore, the database is of
high quality and can be used as an important source for speech analysis
research, especially for the task of multimodal emotional speech analysis.
Related papers
- Speech Emotion Detection Based on MFCC and CNN-LSTM Architecture [0.0]
This paper processes the initial audio input into waveplot and spectrum for analysis and concentrates on multiple features including MFCC as targets for feature extraction.
The architecture achieved an accuracy of 61.07% comprehensively for the test set, among which the detection of anger and neutral reaches a performance of 75.31% and 71.70% respectively.
arXiv Detail & Related papers (2025-01-18T06:15:54Z) - Leveraging Cross-Attention Transformer and Multi-Feature Fusion for Cross-Linguistic Speech Emotion Recognition [60.58049741496505]
Speech Emotion Recognition (SER) plays a crucial role in enhancing human-computer interaction.
We propose a novel approach HuMP-CAT, which combines HuBERT, MFCC, and prosodic characteristics.
We show that, by fine-tuning the source model with a small portion of speech from the target datasets, HuMP-CAT achieves an average accuracy of 78.75%.
arXiv Detail & Related papers (2025-01-06T14:31:25Z) - Enriching Multimodal Sentiment Analysis through Textual Emotional Descriptions of Visual-Audio Content [56.62027582702816]
Multimodal Sentiment Analysis seeks to unravel human emotions by amalgamating text, audio, and visual data.
Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge.
We introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions.
arXiv Detail & Related papers (2024-12-12T11:30:41Z) - EMOVOME: A Dataset for Emotion Recognition in Spontaneous Real-Life Speech [2.1455880234227624]
Spontaneous datasets for Speech Emotion Recognition (SER) are scarce and frequently derived from laboratory environments or staged scenarios.
We developed and publicly released the Emotional Voice Messages (EMOVOME) dataset, including 999 voice messages from real conversations of 100 Spanish speakers on a messaging app.
We evaluated speaker-independent SER models using acoustic features as baseline and transformer-based models.
arXiv Detail & Related papers (2024-03-04T16:13:39Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Design, construction and evaluation of emotional multimodal pathological
speech database [8.774681418339155]
The first Chinese multimodal emotional pathological speech database containing multi-perspective information is constructed.
All emotional speech was labeled for intelligibility, types and discrete dimensional emotions by developed WeChat mini-program.
The automatic recognition tested on speech and glottal data, with average accuracy of 78% for controls and 60% for patients in audio, while 51% for controls and 38% for patients in glottal data, indicating an influence of the disease on emotional expression.
arXiv Detail & Related papers (2023-12-14T14:43:31Z) - SER_AMPEL: a multi-source dataset for speech emotion recognition of
Italian older adults [58.49386651361823]
SER_AMPEL is a multi-source dataset for speech emotion recognition (SER)
It is collected with the aim of providing a reference for speech emotion recognition in case of Italian older adults.
The evidence of the need for such a dataset emerges from the analysis of the state of the art.
arXiv Detail & Related papers (2023-11-24T13:47:25Z) - Feature Selection Enhancement and Feature Space Visualization for
Speech-Based Emotion Recognition [2.223733768286313]
We present speech features enhancement strategy that improves speech emotion recognition.
The strategy is compared with the state-of-the-art methods used in the literature.
Our method achieved an average recognition gain of 11.5% for six out of seven emotions for the EMO-DB dataset, and 13.8% for seven out of eight emotions for the RAVDESS dataset.
arXiv Detail & Related papers (2022-08-19T11:29:03Z) - BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for
Conversational Gestures Synthesis [9.95713767110021]
Body-Expression-Audio-Text dataset has i) 76 hours, high-quality, multi-modal data captured from 30 speakers talking with eight different emotions and in four different languages.
BEAT is the largest motion capture dataset for investigating the human gestures.
arXiv Detail & Related papers (2022-03-10T11:19:52Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Vyaktitv: A Multimodal Peer-to-Peer Hindi Conversations based Dataset
for Personality Assessment [50.15466026089435]
We present a novel peer-to-peer Hindi conversation dataset- Vyaktitv.
It consists of high-quality audio and video recordings of the participants, with Hinglish textual transcriptions for each conversation.
The dataset also contains a rich set of socio-demographic features, like income, cultural orientation, amongst several others, for all the participants.
arXiv Detail & Related papers (2020-08-31T17:44:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.