Exploring emotional prototypes in a high dimensional TTS latent space
- URL: http://arxiv.org/abs/2105.01891v1
- Date: Wed, 5 May 2021 06:49:21 GMT
- Title: Exploring emotional prototypes in a high dimensional TTS latent space
- Authors: Pol van Rijn, Silvan Mertes, Dominik Schiller, Peter M. C. Harrison,
Pauline Larrouy-Maestri, Elisabeth Andr\'e, Nori Jacoby
- Abstract summary: We search the prosodic latent space in a trained GST Tacotron model to explore prototypes of emotional prosody.
We demonstrate that particular regions of the model's latent space are reliably associated with particular emotions.
- Score: 3.4404376509754506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent TTS systems are able to generate prosodically varied and realistic
speech. However, it is unclear how this prosodic variation contributes to the
perception of speakers' emotional states. Here we use the recent psychological
paradigm 'Gibbs Sampling with People' to search the prosodic latent space in a
trained GST Tacotron model to explore prototypes of emotional prosody.
Participants are recruited online and collectively manipulate the latent space
of the generative speech model in a sequentially adaptive way so that the
stimulus presented to one group of participants is determined by the response
of the previous groups. We demonstrate that (1) particular regions of the
model's latent space are reliably associated with particular emotions, (2) the
resulting emotional prototypes are well-recognized by a separate group of human
raters, and (3) these emotional prototypes can be effectively transferred to
new sentences. Collectively, these experiments demonstrate a novel approach to
the understanding of emotional speech by providing a tool to explore the
relation between the latent space of generative models and human semantics.
Related papers
- MEMO-Bench: A Multiple Benchmark for Text-to-Image and Multimodal Large Language Models on Human Emotion Analysis [53.012111671763776]
This study introduces MEMO-Bench, a comprehensive benchmark consisting of 7,145 portraits, each depicting one of six different emotions.
Results demonstrate that existing T2I models are more effective at generating positive emotions than negative ones.
Although MLLMs show a certain degree of effectiveness in distinguishing and recognizing human emotions, they fall short of human-level accuracy.
arXiv Detail & Related papers (2024-11-18T02:09:48Z) - Emotional Dimension Control in Language Model-Based Text-to-Speech: Spanning a Broad Spectrum of Human Emotions [37.075331767703986]
Current emotional text-to-speech systems face challenges in mimicking a broad spectrum of human emotions.
This paper proposes a TTS framework that facilitates control over pleasure, arousal, and dominance.
It can synthesize a diversity of emotional styles without requiring any emotional speech data during TTS training.
arXiv Detail & Related papers (2024-09-25T07:16:16Z) - UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.
EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.
EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z) - Dynamic Causal Disentanglement Model for Dialogue Emotion Detection [77.96255121683011]
We propose a Dynamic Causal Disentanglement Model based on hidden variable separation.
This model effectively decomposes the content of dialogues and investigates the temporal accumulation of emotions.
Specifically, we propose a dynamic temporal disentanglement model to infer the propagation of utterances and hidden variables.
arXiv Detail & Related papers (2023-09-13T12:58:09Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - deep learning of segment-level feature representation for speech emotion
recognition in conversations [9.432208348863336]
We propose a conversational speech emotion recognition method to deal with capturing attentive contextual dependency and speaker-sensitive interactions.
First, we use a pretrained VGGish model to extract segment-based audio representation in individual utterances.
Second, an attentive bi-directional recurrent unit (GRU) models contextual-sensitive information and explores intra- and inter-speaker dependencies jointly.
arXiv Detail & Related papers (2023-02-05T16:15:46Z) - Think Twice: A Human-like Two-stage Conversational Agent for Emotional Response Generation [16.659457455269127]
We propose a two-stage conversational agent for the generation of emotional dialogue.
First, a dialogue model trained without the emotion-annotated dialogue corpus generates a prototype response that meets the contextual semantics.
Secondly, the first-stage prototype is modified by a controllable emotion refiner with the empathy hypothesis.
arXiv Detail & Related papers (2023-01-12T10:03:56Z) - Semi-supervised learning for continuous emotional intensity controllable
speech synthesis with disentangled representations [16.524515747017787]
We propose a novel method to control the continuous intensity of emotions using semi-supervised learning.
The experimental results showed that the proposed method was superior in controllability and naturalness.
arXiv Detail & Related papers (2022-11-11T12:28:07Z) - Bridging the prosody GAP: Genetic Algorithm with People to efficiently
sample emotional prosody [1.2891210250935146]
'Genetic Algorithm with People' (GAP) integrates human decision and production into a genetic algorithm.
We demonstrate that GAP can efficiently sample from the emotional speech space and capture a broad range of emotions.
GAP is language-independent and supports large crowd-sourcing, thus can support future large-scale cross-cultural research.
arXiv Detail & Related papers (2022-05-10T11:45:15Z) - Data-driven emotional body language generation for social robotics [58.88028813371423]
In social robotics, endowing humanoid robots with the ability to generate bodily expressions of affect can improve human-robot interaction and collaboration.
We implement a deep learning data-driven framework that learns from a few hand-designed robotic bodily expressions.
The evaluation study found that the anthropomorphism and animacy of the generated expressions are not perceived differently from the hand-designed ones.
arXiv Detail & Related papers (2022-05-02T09:21:39Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Reinforcement Learning for Emotional Text-to-Speech Synthesis with
Improved Emotion Discriminability [82.39099867188547]
Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years.
We propose a new interactive training paradigm for ETTS, denoted as i-ETTS.
We formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization.
arXiv Detail & Related papers (2021-04-03T13:52:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.