Controllable Generation of Artificial Speaker Embeddings through
Discovery of Principal Directions
- URL: http://arxiv.org/abs/2310.17502v1
- Date: Thu, 26 Oct 2023 15:54:12 GMT
- Title: Controllable Generation of Artificial Speaker Embeddings through
Discovery of Principal Directions
- Authors: Florian Lux, Pascal Tilli, Sarina Meyer, Ngoc Thang Vu
- Abstract summary: We propose a method to generate artificial speaker embeddings that cannot be linked to a real human.
The controllable embeddings can be fed to a speech synthesis system conditioned on embeddings of real humans during training.
- Score: 29.03308434639149
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Customizing voice and speaking style in a speech synthesis system with
intuitive and fine-grained controls is challenging, given that little data with
appropriate labels is available. Furthermore, editing an existing human's voice
also comes with ethical concerns. In this paper, we propose a method to
generate artificial speaker embeddings that cannot be linked to a real human
while offering intuitive and fine-grained control over the voice and speaking
style of the embeddings, without requiring any labels for speaker or style. The
artificial and controllable embeddings can be fed to a speech synthesis system,
conditioned on embeddings of real humans during training, without sacrificing
privacy during inference.
Related papers
- Coding Speech through Vocal Tract Kinematics [5.0751585360524425]
Articulatory features are traces of kinematic shapes of vocal tract articulators and source features, which are intuitively interpretable and controllable.
Speaker embedding is effectively disentangled from articulations, which enables accent-perserving zero-shot voice conversion.
arXiv Detail & Related papers (2024-06-18T18:38:17Z) - ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis [50.69464138626748]
We present ConvoFusion, a diffusion-based approach for multi-modal gesture synthesis.
Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities.
Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures.
arXiv Detail & Related papers (2024-03-26T17:59:52Z) - Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - Towards Spontaneous Style Modeling with Semi-supervised Pre-training for
Conversational Text-to-Speech Synthesis [53.511443791260206]
We propose a semi-supervised pre-training method to increase the amount of spontaneous-style speech and spontaneous behavioral labels.
In the process of semi-supervised learning, both text and speech information are considered for detecting spontaneous behaviors labels in speech.
arXiv Detail & Related papers (2023-08-31T09:50:33Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - Simple and Effective Unsupervised Speech Synthesis [97.56065543192699]
We introduce the first unsupervised speech synthesis system based on a simple, yet effective recipe.
Using only unlabeled speech audio and unlabeled text as well as a lexicon, our method enables speech synthesis without the need for a human-labeled corpus.
arXiv Detail & Related papers (2022-04-06T00:19:13Z) - Expressive Neural Voice Cloning [12.010555227327743]
We propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker.
We show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker.
arXiv Detail & Related papers (2021-01-30T05:09:57Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.