EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine
Learning Classification Methodologies
- URL: http://arxiv.org/abs/2301.00508v2
- Date: Thu, 6 Apr 2023 15:25:44 GMT
- Title: EmoGator: A New Open Source Vocal Burst Dataset with Baseline Machine
Learning Classification Methodologies
- Authors: Fred W. Buhl
- Abstract summary: The EmoGator dataset consists of 32,130 samples from 357 speakers, 16.9654 hours of audio.
Each sample classified into one of 30 distinct emotion categories by the speaker.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vocal Bursts -- short, non-speech vocalizations that convey emotions, such as
laughter, cries, sighs, moans, and groans -- are an often-overlooked aspect of
speech emotion recognition, but an important aspect of human vocal
communication. One barrier to study of these interesting vocalizations is a
lack of large datasets. I am pleased to introduce the EmoGator dataset, which
consists of 32,130 samples from 357 speakers, 16.9654 hours of audio; each
sample classified into one of 30 distinct emotion categories by the speaker.
Several different approaches to construct classifiers to identify emotion
categories will be discussed, and directions for future research will be
suggested. Data set is available for download from
https://github.com/fredbuhl/EmoGator.
Related papers
- Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - A Hierarchical Regression Chain Framework for Affective Vocal Burst
Recognition [72.36055502078193]
We propose a hierarchical framework, based on chain regression models, for affective recognition from vocal bursts.
To address the challenge of data sparsity, we also use self-supervised learning (SSL) representations with layer-wise and temporal aggregation modules.
The proposed systems participated in the ACII Affective Vocal Burst (A-VB) Challenge 2022 and ranked first in the "TWO'' and "CULTURE" tasks.
arXiv Detail & Related papers (2023-03-14T16:08:45Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Proceedings of the ICML 2022 Expressive Vocalizations Workshop and
Competition: Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022, included three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
The first, ExVo-MultiTask, requires participants to train a multi-task model to recognize expressed emotions and demographic traits from vocal bursts.
The second, ExVo-Generate, requires participants to train a generative model that produces vocal bursts conveying ten different emotions.
arXiv Detail & Related papers (2022-07-14T14:30:34Z) - Vocalsound: A Dataset for Improving Human Vocal Sounds Recognition [13.373579620368046]
We have created a VocalSound dataset consisting of over 21,000 crowdsourced recordings of laughter, sighs, coughs, throat clearing, sneezes, and sniffs.
Experiments show that the vocal sound recognition performance of a model can be significantly improved by 41.9% by adding VocalSound dataset to an existing dataset as training material.
arXiv Detail & Related papers (2022-05-06T18:08:18Z) - The ICML 2022 Expressive Vocalizations Workshop and Competition:
Recognizing, Generating, and Personalizing Vocal Bursts [28.585851793516873]
ExVo 2022 includes three competition tracks using a large-scale dataset of 59,201 vocalizations from 1,702 speakers.
This paper describes the three tracks and provides performance measures for baseline models using state-of-the-art machine learning strategies.
arXiv Detail & Related papers (2022-05-03T21:06:44Z) - Textless Speech Emotion Conversion using Decomposed and Discrete
Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.