Expressive Neural Voice Cloning
- URL: http://arxiv.org/abs/2102.00151v1
- Date: Sat, 30 Jan 2021 05:09:57 GMT
- Title: Expressive Neural Voice Cloning
- Authors: Paarth Neekhara, Shehzeen Hussain, Shlomo Dubnov, Farinaz Koushanfar,
Julian McAuley
- Abstract summary: We propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker.
We show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker.
- Score: 12.010555227327743
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice cloning is the task of learning to synthesize the voice of an unseen
speaker from a few samples. While current voice cloning methods achieve
promising results in Text-to-Speech (TTS) synthesis for a new voice, these
approaches lack the ability to control the expressiveness of synthesized audio.
In this work, we propose a controllable voice cloning method that allows
fine-grained control over various style aspects of the synthesized speech for
an unseen speaker. We achieve this by explicitly conditioning the speech
synthesis model on a speaker encoding, pitch contour and latent style tokens
during training. Through both quantitative and qualitative evaluations, we show
that our framework can be used for various expressive voice cloning tasks using
only a few transcribed or untranscribed speech samples for a new speaker. These
cloning tasks include style transfer from a reference speech, synthesizing
speech directly from text, and fine-grained style control by manipulating the
style conditioning variables during inference.
Related papers
- Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt [50.25271407721519]
We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language.
We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation.
Experiments show that our model achieves favorable controlling ability and audio quality.
arXiv Detail & Related papers (2024-03-18T13:39:05Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech [25.707717591185386]
We show that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality.
All of our code and trained models are available, alongside static and interactive demos.
arXiv Detail & Related papers (2022-06-24T11:54:59Z) - Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention [0.0]
We propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech.
Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention.
We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances.
arXiv Detail & Related papers (2022-01-25T15:06:07Z) - Meta-Voice: Fast few-shot style transfer for expressive voice cloning
using meta learning [37.73490851004852]
Task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data.
This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time.
In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning.
arXiv Detail & Related papers (2021-11-14T01:30:37Z) - Towards Multi-Scale Style Control for Expressive Speech Synthesis [60.08928435252417]
The proposed method employs a multi-scale reference encoder to extract both the global-scale utterance-level and the local-scale quasi-phoneme-level style features of the target speech.
During training time, the multi-scale style model could be jointly trained with the speech synthesis model in an end-to-end fashion.
arXiv Detail & Related papers (2021-04-08T05:50:09Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z) - From Speaker Verification to Multispeaker Speech Synthesis, Deep
Transfer with Feedback Constraint [11.982748481062542]
This paper presents a system involving feedback constraint for multispeaker speech synthesis.
We manage to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verification network.
The model is trained and evaluated on publicly available datasets.
arXiv Detail & Related papers (2020-05-10T06:11:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.