Meta-Voice: Fast few-shot style transfer for expressive voice cloning
using meta learning
- URL: http://arxiv.org/abs/2111.07218v1
- Date: Sun, 14 Nov 2021 01:30:37 GMT
- Title: Meta-Voice: Fast few-shot style transfer for expressive voice cloning
using meta learning
- Authors: Songxiang Liu, Dan Su, Dong Yu
- Abstract summary: Task of few-shot style transfer for voice cloning in text-to-speech (TTS) synthesis aims at transferring speaking styles of an arbitrary source speaker to a target speaker's voice using very limited amount of neutral data.
This is a very challenging task since the learning algorithm needs to deal with few-shot voice cloning and speaker-prosody disentanglement at the same time.
In this paper, we approach to the hard fast few-shot style transfer for voice cloning task using meta learning.
- Score: 37.73490851004852
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The task of few-shot style transfer for voice cloning in text-to-speech (TTS)
synthesis aims at transferring speaking styles of an arbitrary source speaker
to a target speaker's voice using very limited amount of neutral data. This is
a very challenging task since the learning algorithm needs to deal with
few-shot voice cloning and speaker-prosody disentanglement at the same time.
Accelerating the adaptation process for a new target speaker is of importance
in real-world applications, but even more challenging. In this paper, we
approach to the hard fast few-shot style transfer for voice cloning task using
meta learning. We investigate the model-agnostic meta-learning (MAML) algorithm
and meta-transfer a pre-trained multi-speaker and multi-prosody base TTS model
to be highly sensitive for adaptation with few samples. Domain adversarial
training mechanism and orthogonal constraint are adopted to disentangle speaker
and prosody representations for effective cross-speaker style transfer.
Experimental results show that the proposed approach is able to conduct fast
voice cloning using only 5 samples (around 12 second speech data) from a target
speaker, with only 100 adaptation steps. Audio samples are available online.
Related papers
- Multi-modal Adversarial Training for Zero-Shot Voice Cloning [9.823246184635103]
We propose a Transformer encoder-decoder architecture to conditionally discriminate between real and generated speech features.
We introduce our novel adversarial training technique by applying it to a FastSpeech2 acoustic model and training on Libriheavy, a large multi-speaker dataset.
Our model achieves improvements over the baseline in terms of speech quality and speaker similarity.
arXiv Detail & Related papers (2024-08-28T16:30:41Z) - Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language [0.4810348726854312]
A neural vocal cloning system can mimic someone's voice using just a few audio samples.
Speaker encoding and speaker adaptation are topics of research in the field of voice cloning.
The main goal is to create a vocal cloning system that produces audio output with a Nepali accent.
arXiv Detail & Related papers (2024-08-19T16:15:09Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - Many-to-Many Voice Conversion based Feature Disentanglement using
Variational Autoencoder [2.4975981795360847]
We propose a new method based on feature disentanglement to tackle many to many voice conversion.
The method has the capability to disentangle speaker identity and linguistic content from utterances.
It can convert from many source speakers to many target speakers with a single autoencoder network.
arXiv Detail & Related papers (2021-07-11T13:31:16Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - Investigating on Incorporating Pretrained and Learnable Speaker
Representations for Multi-Speaker Multi-Style Text-to-Speech [54.75722224061665]
In this work, we investigate different speaker representations and proposed to integrate pretrained and learnable speaker representations.
The FastSpeech 2 model combined with both pretrained and learnable speaker representations shows great generalization ability on few-shot speakers.
arXiv Detail & Related papers (2021-03-06T10:14:33Z) - Expressive Neural Voice Cloning [12.010555227327743]
We propose a controllable voice cloning method that allows fine-grained control over various style aspects of the synthesized speech for an unseen speaker.
We show that our framework can be used for various expressive voice cloning tasks using only a few transcribed or untranscribed speech samples for a new speaker.
arXiv Detail & Related papers (2021-01-30T05:09:57Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.