Related papers: Describing emotions with acoustic property prompts for speech emotion recognition

Describing emotions with acoustic property prompts for speech emotion recognition

URL: http://arxiv.org/abs/2211.07737v1
Date: Mon, 14 Nov 2022 20:29:37 GMT
Title: Describing emotions with acoustic property prompts for speech emotion recognition
Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha Raj, Rita Singh
Abstract summary: We devise a method to automatically create a description for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate. We train a neural network model using these audio-text pairs and evaluate the model using one more dataset. We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
Score: 30.990720176317463
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Emotions lie on a broad continuum and treating emotions as a discrete number of classes limits the ability of a model to capture the nuances in the continuum. The challenge is how to describe the nuances of emotions and how to enable a model to learn the descriptions. In this work, we devise a method to automatically create a description (or prompt) for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate. We pair a prompt with its corresponding audio using 5 different emotion datasets. We trained a neural network model using these audio-text pairs. Then, we evaluate the model using one more dataset. We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval. We expect our findings to motivate research describing the broad continuum of emotion

Related papers

Beyond Classification: Towards Speech Emotion Reasoning with Multitask AudioLLMs [37.62433475609052]
We develop a strategy to enhance emotion recognition by producing semantically aligned, evidence-grounded explanations.<n>We introduce a unified framework combining reasoning-augmented data supervision, dual-encoder architecture, and task-alternating training.<n> Experiments on IEMOCAP and MELD show that our approach not only improves emotion prediction accuracy but also enhances the coherence and evidential grounding of the generated responses.
arXiv Detail & Related papers (2025-06-07T14:52:58Z)
EmotionCaps: Enhancing Audio Captioning Through Emotion-Augmented Data Generation [3.696171835644556]
We introduce EmotionCaps, an audio captioning dataset comprised of approximately 120,000 audio clips with paired synthetic descriptions enriched with soundscape emotion recognition information. Our findings challenge current approaches to captioning and suggest new directions for developing and assessing captioning models.
arXiv Detail & Related papers (2024-10-15T19:57:37Z)
Speech Emotion Recognition Using CNN and Its Use Case in Digital Healthcare [0.0]
The process of identifying human emotion and affective states from speech is known as speech emotion recognition (SER) My research seeks to use the Convolutional Neural Network (CNN) to distinguish emotions from audio recordings and label them in accordance with the range of different emotions. I have developed a machine learning model to identify emotions from supplied audio files with the aid of machine learning methods.
arXiv Detail & Related papers (2024-06-15T21:33:03Z)
Emotion Rendering for Conversational Speech Synthesis with Heterogeneous Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity. Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z)
Prompting Audios Using Acoustic Properties For Emotion Representation [36.275219004598874]
We propose the use of natural language descriptions (or prompts) to better represent emotions. We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts. Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
arXiv Detail & Related papers (2023-10-03T13:06:58Z)
Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions. We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework. At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z)
Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech. Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z)
Textless Speech Emotion Conversion using Decomposed and Discrete Representations [49.55101900501656]
We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
arXiv Detail & Related papers (2021-11-14T18:16:42Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Modality-Transferable Emotion Embeddings for Low-Resource Multimodal Emotion Recognition [55.44502358463217]
We propose a modality-transferable model with emotion embeddings to tackle the aforementioned issues. Our model achieves state-of-the-art performance on most of the emotion categories. Our model also outperforms existing baselines in the zero-shot and few-shot scenarios for unseen emotions.
arXiv Detail & Related papers (2020-09-21T06:10:39Z)

This list is automatically generated from the titles and abstracts of the papers in this site.