Prompting Audios Using Acoustic Properties For Emotion Representation
- URL: http://arxiv.org/abs/2310.02298v3
- Date: Thu, 7 Dec 2023 03:22:52 GMT
- Title: Prompting Audios Using Acoustic Properties For Emotion Representation
- Authors: Hira Dhamyal, Benjamin Elizalde, Soham Deshmukh, Huaming Wang, Bhiksha
Raj, Rita Singh
- Abstract summary: We propose the use of natural language descriptions (or prompts) to better represent emotions.
We use acoustic properties that are correlated to emotion like pitch, intensity, speech rate, and articulation rate to automatically generate prompts.
Our results show that the acoustic prompts significantly improve the model's performance in various Precision@K metrics.
- Score: 36.275219004598874
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Emotions lie on a continuum, but current models treat emotions as a finite
valued discrete variable. This representation does not capture the diversity in
the expression of emotion. To better represent emotions we propose the use of
natural language descriptions (or prompts). In this work, we address the
challenge of automatically generating these prompts and training a model to
better learn emotion representations from audio and prompt pairs. We use
acoustic properties that are correlated to emotion like pitch, intensity,
speech rate, and articulation rate to automatically generate prompts i.e.
'acoustic prompts'. We use a contrastive learning objective to map speech to
their respective acoustic prompts. We evaluate our model on Emotion Audio
Retrieval and Speech Emotion Recognition. Our results show that the acoustic
prompts significantly improve the model's performance in EAR, in various
Precision@K metrics. In SER, we observe a 3.8% relative accuracy improvement on
the Ravdess dataset.
Related papers
- AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect
Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations.
We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space.
We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - Describing emotions with acoustic property prompts for speech emotion
recognition [30.990720176317463]
We devise a method to automatically create a description for a given audio by computing acoustic properties, such as pitch, loudness, speech rate, and articulation rate.
We train a neural network model using these audio-text pairs and evaluate the model using one more dataset.
We investigate how the model can learn to associate the audio with the descriptions, resulting in performance improvement of Speech Emotion Recognition and Speech Audio Retrieval.
arXiv Detail & Related papers (2022-11-14T20:29:37Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z) - x-vectors meet emotions: A study on dependencies between emotion and
speaker recognition [38.181055783134006]
We show that knowledge learned for speaker recognition can be reused for emotion recognition through transfer learning.
For emotion recognition, we show that using a simple linear model is enough to obtain good performance on the features extracted from pre-trained models.
We present results on the effect of emotion on speaker verification.
arXiv Detail & Related papers (2020-02-12T15:13:07Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.