Analysis of impact of emotions on target speech extraction and speech
separation
- URL: http://arxiv.org/abs/2208.07091v1
- Date: Mon, 15 Aug 2022 09:47:13 GMT
- Title: Analysis of impact of emotions on target speech extraction and speech
separation
- Authors: J\'an \v{S}vec, Kate\v{r}ina \v{Z}mol\'ikov\'a, Martin Kocour, Marc
Delcroix, Tsubasa Ochiai, Ladislav Mo\v{s}ner, Jan \v{C}ernock\'y
- Abstract summary: We investigate the influence of emotions on blind speech separation (BSS) and target speech extraction (TSE)
We observe that BSS is relatively robust to emotions, while TSE, which requires identifying and extracting the speech of a target speaker, is much more sensitive to emotions.
- Score: 30.06415464303977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, the performance of blind speech separation (BSS) and target speech
extraction (TSE) has greatly progressed. Most works, however, focus on
relatively well-controlled conditions using, e.g., read speech. The performance
may degrade in more realistic situations. One of the factors causing such
degradation may be intrinsic speaker variability, such as emotions, occurring
commonly in realistic speech. In this paper, we investigate the influence of
emotions on TSE and BSS. We create a new test dataset of emotional mixtures for
the evaluation of TSE and BSS. This dataset combines LibriSpeech and Ryerson
Audio-Visual Database of Emotional Speech and Song (RAVDESS). Through
controlled experiments, we can analyze the impact of different emotions on the
performance of BSS and TSE. We observe that BSS is relatively robust to
emotions, while TSE, which requires identifying and extracting the speech of a
target speaker, is much more sensitive to emotions. On comparative speaker
verification experiments we show that identifying the target speaker may be
particularly challenging when dealing with emotional speech. Using our
findings, we outline potential future directions that could improve the
robustness of BSS and TSE systems toward emotional speech.
Related papers
- EmoNews: A Spoken Dialogue System for Expressive News Conversations [2.6036734225145133]
We develop a task-oriented spoken dialogue system (SDS) that regulates emotional speech based on contextual cues.<n>We propose subjective evaluation scale for emotional SDSs and judge the emotion regulation performance of the proposed and baseline systems.
arXiv Detail & Related papers (2025-06-16T18:16:04Z) - Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset [52.95197015472105]
EmoCorrector is a novel post-correction scheme for text-based speech editing.<n>It retrieves the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion.<n>EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods.
arXiv Detail & Related papers (2025-05-24T16:10:56Z) - Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions.
This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance.
It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - Multiscale Contextual Learning for Speech Emotion Recognition in
Emergency Call Center Conversations [4.297070083645049]
This paper presents a multi-scale conversational context learning approach for speech emotion recognition.
We investigated this approach on both speech transcriptions and acoustic segments.
According to our tests, the context derived from previous tokens has a more significant influence on accurate prediction than the following tokens.
arXiv Detail & Related papers (2023-08-28T20:31:45Z) - Learning Emotional Representations from Imbalanced Speech Data for
Speech Emotion Recognition and Emotional Text-to-Speech [1.4986031916712106]
Speech emotional representations play a key role in Speech Emotion Recognition (SER) and Emotional Text-To-Speech (TTS) tasks.
Models might overfit to the majority Neutral class and fail to produce robust and effective emotional representations.
We use augmentation approaches to train the model and enable it to extract effective and generalizable emotional representations from imbalanced datasets.
arXiv Detail & Related papers (2023-06-09T07:04:56Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - Accurate Emotion Strength Assessment for Seen and Unseen Speech Based on
Data-Driven Deep Learning [70.30713251031052]
We propose a data-driven deep learning model, i.e. StrengthNet, to improve the generalization of emotion strength assessment for seen and unseen speech.
Experiments show that the predicted emotion strength of the proposed StrengthNet is highly correlated with ground truth scores for both seen and unseen speech.
arXiv Detail & Related papers (2022-06-15T01:25:32Z) - Emotion Intensity and its Control for Emotional Voice Conversion [77.05097999561298]
Emotional voice conversion (EVC) seeks to convert the emotional state of an utterance while preserving the linguistic content and speaker identity.
In this paper, we aim to explicitly characterize and control the intensity of emotion.
We propose to disentangle the speaker style from linguistic content and encode the speaker style into a style embedding in a continuous space that forms the prototype of emotion embedding.
arXiv Detail & Related papers (2022-01-10T02:11:25Z) - Emotional Prosody Control for Speech Generation [7.66200737962746]
We propose a text to speech(TTS) system, where a user can choose the emotion of generated speech from a continuous and meaningful emotion space.
The proposed TTS system can generate speech from the text in any speaker's style, with fine control of emotion.
arXiv Detail & Related papers (2021-11-07T08:52:04Z) - E-ffective: A Visual Analytic System for Exploring the Emotion and
Effectiveness of Inspirational Speeches [57.279044079196105]
E-ffective is a visual analytic system allowing speaking experts and novices to analyze both the role of speech factors and their contribution in effective speeches.
Two novel visualizations include E-spiral (that shows the emotional shifts in speeches in a visually compact way) and E-script (that connects speech content with key speech delivery information.
arXiv Detail & Related papers (2021-10-28T06:14:27Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.