Speech and Text-Based Emotion Recognizer
- URL: http://arxiv.org/abs/2312.11503v1
- Date: Sun, 10 Dec 2023 05:17:39 GMT
- Title: Speech and Text-Based Emotion Recognizer
- Authors: Varun Sharma
- Abstract summary: We build a balanced corpus from publicly available datasets for speech emotion recognition.
Our best system, a multi-modal speech, and text-based model, provides a performance of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the baseline algorithm performance of 119.66.
- Score: 0.9168634432094885
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Affective computing is a field of study that focuses on developing systems
and technologies that can understand, interpret, and respond to human emotions.
Speech Emotion Recognition (SER), in particular, has got a lot of attention
from researchers in the recent past. However, in many cases, the publicly
available datasets, used for training and evaluation, are scarce and imbalanced
across the emotion labels. In this work, we focused on building a balanced
corpus from these publicly available datasets by combining these datasets as
well as employing various speech data augmentation techniques. Furthermore, we
experimented with different architectures for speech emotion recognition. Our
best system, a multi-modal speech, and text-based model, provides a performance
of UA(Unweighed Accuracy) + WA (Weighed Accuracy) of 157.57 compared to the
baseline algorithm performance of 119.66
Related papers
- Improving Speech-based Emotion Recognition with Contextual Utterance Analysis and LLMs [2.8728982844941178]
Speech Emotion Recognition (SER) focuses on identifying emotional states from spoken language.
We propose a novel approach that first refines all available transcriptions to ensure data reliability.
We then segment each complete conversation into smaller dialogues and use these dialogues as context to predict the emotion of the target utterance within the dialogue.
arXiv Detail & Related papers (2024-10-27T04:23:34Z) - Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare [0.0]
The process of identifying human emotion and affective states from speech is known as speech emotion recognition (SER)
My research seeks to use the Convolutional Neural Network (CNN) to distinguish emotions from audio recordings and label them in accordance with the range of different emotions.
I have developed a machine learning model to identify emotions from supplied audio files with the aid of machine learning methods.
arXiv Detail & Related papers (2024-06-15T21:33:03Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Self-Supervised Speech Representation Learning: A Review [105.1545308184483]
Self-supervised representation learning methods promise a single universal model that would benefit a wide variety of tasks and domains.
Speech representation learning is experiencing similar progress in three main categories: generative, contrastive, and predictive methods.
This review presents approaches for self-supervised speech representation learning and their connection to other research areas.
arXiv Detail & Related papers (2022-05-21T16:52:57Z) - Multimodal Emotion Recognition using Transfer Learning from Speaker
Recognition and BERT-based models [53.31917090073727]
We propose a neural network-based emotion recognition framework that uses a late fusion of transfer-learned and fine-tuned models from speech and text modalities.
We evaluate the effectiveness of our proposed multimodal approach on the interactive emotional dyadic motion capture dataset.
arXiv Detail & Related papers (2022-02-16T00:23:42Z) - Multimodal Emotion Recognition with High-level Speech and Text Features [8.141157362639182]
We propose a novel cross-representation speech model to perform emotion recognition on wav2vec 2.0 speech features.
We also train a CNN-based model to recognize emotions from text features extracted with Transformer-based models.
Our method is evaluated on the IEMOCAP dataset in a 4-class classification problem.
arXiv Detail & Related papers (2021-09-29T07:08:40Z) - Dialog speech sentiment classification for imbalanced datasets [7.84604505907019]
In this paper, we use single and bi-modal analysis of short dialog utterances and gain insights on the main factors that aid in sentiment detection.
We propose an architecture which uses a learning rate scheduler and different monitoring criteria and provides state-of-the-art results for the SWITCHBOARD imbalanced sentiment dataset.
arXiv Detail & Related papers (2021-09-15T11:43:04Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Temporal aggregation of audio-visual modalities for emotion recognition [0.5352699766206808]
We propose a multimodal fusion technique for emotion recognition based on combining audio-visual modalities from a temporal window with different temporal offsets for each modality.
Our proposed method outperforms other methods from the literature and human accuracy rating.
arXiv Detail & Related papers (2020-07-08T18:44:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.