Related papers: Textless Speech Emotion Conversion using Decomposed and Discrete Representations

Textless Speech Emotion Conversion using Decomposed and Discrete Representations

URL: http://arxiv.org/abs/2111.07402v1
Date: Sun, 14 Nov 2021 18:16:42 GMT
Title: Textless Speech Emotion Conversion using Decomposed and Discrete Representations
Authors: Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu-Anh Nguyen, Morgane Rivi\`ere, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, Yossi Adi
Abstract summary: We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder.
Score: 49.55101900501656
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech emotion conversion is the task of modifying the perceived emotion of a speech utterance while preserving the lexical content and speaker identity. In this study, we cast the problem of emotion conversion as a spoken language translation task. We decompose speech into discrete and disentangled learned representations, consisting of content units, F0, speaker, and emotion. First, we modify the speech content by translating the content units to a target emotion, and then predict the prosodic features based on these units. Finally, the speech waveform is generated by feeding the predicted representations into a neural vocoder. Such a paradigm allows us to go beyond spectral and parametric changes of the signal, and model non-verbal vocalizations, such as laughter insertion, yawning removal, etc. We demonstrate objectively and subjectively that the proposed method is superior to the baselines in terms of perceived emotion and audio quality. We rigorously evaluate all components of such a complex system and conclude with an extensive model analysis and ablation study to better emphasize the architectural choices, strengths and weaknesses of the proposed method. Samples and code will be publicly available under the following link: https://speechbot.github.io/emotion.

Related papers

AffectEcho: Speaker Independent and Language-Agnostic Emotion and Affect Transfer for Speech Synthesis [13.918119853846838]
Affect is an emotional characteristic encompassing valence, arousal, and intensity, and is a crucial attribute for enabling authentic conversations. We propose AffectEcho, an emotion translation model, that uses a Vector Quantized codebook to model emotions within a quantized space. We demonstrate the effectiveness of our approach in controlling the emotions of generated speech while preserving identity, style, and emotional cadence unique to each speaker.
arXiv Detail & Related papers (2023-08-16T06:28:29Z)
Learning Multilingual Expressive Speech Representation for Prosody Prediction without Parallel Data [0.0]
We propose a method for speech-to-speech emotion translation that operates at the level of discrete speech units. We show that this embedding can be used to predict the pitch and duration of speech units in a target language. We evaluate our approach to English and French speech signals and show that it outperforms a baseline method.
arXiv Detail & Related papers (2023-06-29T08:06:54Z)
In-the-wild Speech Emotion Conversion Using Disentangled Self-Supervised Representations and Neural Vocoder-based Resynthesis [15.16865739526702]
We introduce a methodology that uses self-supervised networks to disentangle the lexical, speaker, and emotional content of the utterance. We then use a HiFiGAN vocoder to resynthesise the disentangled representations to a speech signal of the targeted emotion. Results reveal that the proposed approach is aptly conditioned on the emotional content of input speech and is capable of synthesising natural-sounding speech for a target emotion.
arXiv Detail & Related papers (2023-06-02T21:02:51Z)
SpeechFormer++: A Hierarchical Efficient Framework for Paralinguistic Speech Processing [17.128885611538486]
Paralinguistic speech processing is important in addressing many issues, such as sentiment and neurocognitive disorder analyses. We consider the characteristics of speech and propose a general structure-based framework, called SpeechFormer++, for paralinguistic speech processing. SpeechFormer++ is evaluated on the speech emotion recognition (IEMOCAP & MELD), depression classification (DAIC-WOZ) and Alzheimer's disease detection (Pitt) tasks.
arXiv Detail & Related papers (2023-02-27T11:48:54Z)
A unified one-shot prosody and speaker conversion system with self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech. We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation. Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z)
EMOVIE: A Mandarin Emotion Speech Dataset with a Simple Emotional Text-to-Speech Model [56.75775793011719]
We introduce and publicly release a Mandarin emotion speech dataset including 9,724 samples with audio files and its emotion human-labeled annotation. Unlike those models which need additional reference audio as input, our model could predict emotion labels just from the input text and generate more expressive speech conditioned on the emotion embedding. In the experiment phase, we first validate the effectiveness of our dataset by an emotion classification task. Then we train our model on the proposed dataset and conduct a series of subjective evaluations.
arXiv Detail & Related papers (2021-06-17T08:34:21Z)
Limited Data Emotional Voice Conversion Leveraging Text-to-Speech: Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity. We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data. The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z)
VAW-GAN for Disentanglement and Recomposition of Emotional Elements in Speech [91.92456020841438]
We study the disentanglement and recomposition of emotional elements in speech through variational autoencoding Wasserstein generative adversarial network (VAW-GAN) We propose a speaker-dependent EVC framework that includes two VAW-GAN pipelines, one for spectrum conversion, and another for prosody conversion. Experiments validate the effectiveness of our proposed method in both objective and subjective evaluations.
arXiv Detail & Related papers (2020-11-03T08:49:33Z)
Converting Anyone's Emotion: Towards Speaker-Independent Emotional Voice Conversion [83.14445041096523]
Emotional voice conversion aims to convert the emotion of speech from one state to another while preserving the linguistic content and speaker identity. We propose a speaker-independent emotional voice conversion framework, that can convert anyone's emotion without the need for parallel data. Experiments show that the proposed speaker-independent framework achieves competitive results for both seen and unseen speakers.
arXiv Detail & Related papers (2020-05-13T13:36:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.