Speech Emotion Diarization: Which Emotion Appears When?
- URL: http://arxiv.org/abs/2306.12991v2
- Date: Fri, 20 Oct 2023 11:31:26 GMT
- Title: Speech Emotion Diarization: Which Emotion Appears When?
- Authors: Yingzhi Wang, Mirco Ravanelli, Alya Yacoubi
- Abstract summary: We propose Speech Emotion Diarization (SED) to reflect the fine-grained nature of speech emotions.
Just as Speaker Diarization answers the question of "Who speaks when?", Speech Emotion Diarization answers the question of "Which emotion appears when?"
- Score: 11.84193589275529
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech Emotion Recognition (SER) typically relies on utterance-level
solutions. However, emotions conveyed through speech should be considered as
discrete speech events with definite temporal boundaries, rather than
attributes of the entire utterance. To reflect the fine-grained nature of
speech emotions, we propose a new task: Speech Emotion Diarization (SED). Just
as Speaker Diarization answers the question of "Who speaks when?", Speech
Emotion Diarization answers the question of "Which emotion appears when?". To
facilitate the evaluation of the performance and establish a common benchmark
for researchers, we introduce the Zaion Emotion Dataset (ZED), an openly
accessible speech emotion dataset that includes non-acted emotions recorded in
real-life conditions, along with manually-annotated boundaries of emotion
segments within the utterance. We provide competitive baselines and open-source
the code and the pre-trained models.
Related papers
- Attention-based Interactive Disentangling Network for Instance-level
Emotional Voice Conversion [81.1492897350032]
Emotional Voice Conversion aims to manipulate a speech according to a given emotion while preserving non-emotion components.
We propose an Attention-based Interactive diseNtangling Network (AINN) that leverages instance-wise emotional knowledge for voice conversion.
arXiv Detail & Related papers (2023-12-29T08:06:45Z) - Emotion Rendering for Conversational Speech Synthesis with Heterogeneous
Graph-Based Context Modeling [50.99252242917458]
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting.
To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity.
Our model outperforms the baseline models in understanding and rendering emotions.
arXiv Detail & Related papers (2023-12-19T08:47:50Z) - Where are We in Event-centric Emotion Analysis? Bridging Emotion Role
Labeling and Appraisal-based Approaches [10.736626320566707]
The term emotion analysis in text subsumes various natural language processing tasks.
We argue that emotions and events are related in two ways.
We discuss how to incorporate psychological appraisal theories in NLP models to interpret events.
arXiv Detail & Related papers (2023-09-05T09:56:29Z) - In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised
Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance.
We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion.
Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Experiencer-Specific Emotion and Appraisal Prediction [13.324006587838523]
Emotion classification in NLP assigns emotions to texts, such as sentences or paragraphs.
We focus on the experiencers of events, and assign an emotion (if any holds) to each of them.
Our experiencer-aware models of emotions and appraisals outperform the experiencer-agnostic baselines.
arXiv Detail & Related papers (2022-10-21T16:04:27Z) - Speech Synthesis with Mixed Emotions [77.05097999561298]
We propose a novel formulation that measures the relative difference between the speech samples of different emotions.
We then incorporate our formulation into a sequence-to-sequence emotional text-to-speech framework.
At run-time, we control the model to produce the desired emotion mixture by manually defining an emotion attribute vector.
arXiv Detail & Related papers (2022-08-11T15:45:58Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.