Textless Speech Emotion Conversion using Decomposed and Discrete
Representations
- URL: http://arxiv.org/abs/2111.07402v1
- Date: Sun, 14 Nov 2021 18:16:42 GMT
- Title: Textless Speech Emotion Conversion using Decomposed and Discrete
Representations
- Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh
Nguyen, Morgane Rivi\`ere, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel
Dupoux, Yossi Adi
- Abstract summary: We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion.
First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units.
Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
- Score: 49.55101900501656
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a
speech utterance while preserving the lexical content and speaker identity. In
this study, we cast the problem of emotion conversion as a spoken language
translation task. We decompose speech into discrete and disentangled learned
representations, consisting of content units, F0, speaker, and emotion. First,
we modify the speech content by translating the content units to a target
emotion, and then predict the prosodic features based on these units. Finally,
the speech waveform is generated by feeding the predicted representations into
a neural vocoder. Such a paradigm allows us to go beyond spectral and
parametric changes of the signal, and model non-verbal vocalizations, such as
laughter insertion, yawning removal, etc. We demonstrate objectively and
subjectively that the proposed method is superior to the baselines in terms of
perceived emotion and audio quality. We rigorously evaluate all components of
such a complex system and conclude with an extensive model analysis and
ablation study to better emphasize the architectural choices, strengths and
weaknesses of the proposed method. Samples and code will be publicly available
under the following link: https://speechbot.github.io/emotion.
Related papers
- AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect
Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations.
We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space.
We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z) - Learning Multilingual Expressive Speech Representation for Prosody
Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units.
We show that this embedding can be used to predict the pitch and duration of speech units in a target language.
We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z) - In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised
Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance.
We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion.
Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z) - SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic
Speech Processing [17.128885611538486]
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses.
We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing.
SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
arXiv Detail & Related papers (2023-02-27T11:48:54Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional
Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation.
Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding.
In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - VAW-GAN for Disentanglement and Recomposition of Emotional Elements in
Speech [91.92456020841438]
We study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN)
We propose a speaker-dependent EVC framework that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion.
Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-11-03T08:49:33Z) - Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice
Conversion [83.14445041096523]
Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity.
We propose a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data.
Experiments show that the proposed speaker-independent framework achieves competitive results for both seen and unseen speakers.
arXiv Detail & Related papers (2020-05-13T13:36:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.