Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module
- URL: http://arxiv.org/abs/2202.08164v1
- Date: Wed, 16 Feb 2022 16:12:21 GMT
- Title: Voice Filter: Few-shot text-to-speech speaker adaptation using voice
conversion as a post-processing module
- Authors: Adam Gabry\'s, Goeric Huybrechts, Manuel Sam Ribeiro, Chung-Ming
Chien, Julian Roth, Giulia Comini, Roberto Barra-Chicote, Bartek Perz, Jaime
Lorenzo-Trueba
- Abstract summary: State-of-the-art text-to-speech (TTS) systems require several hours of recorded speech data to generate high-quality synthetic speech.
When using reduced amounts of training data, standard TTS models suffer from speech quality and intelligibility degradations.
We propose a novel extremely low-resource TTS method called Voice Filter that uses as little as one minute of speech from a target speaker.
- Score: 16.369219400819134
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: State-of-the-art text-to-speech (TTS) systems require several hours of
recorded speech data to generate high-quality synthetic speech. When using
reduced amounts of training data, standard TTS models suffer from speech
quality and intelligibility degradations, making training low-resource TTS
systems problematic. In this paper, we propose a novel extremely low-resource
TTS method called Voice Filter that uses as little as one minute of speech from
a target speaker. It uses voice conversion (VC) as a post-processing module
appended to a pre-existing high-quality TTS system and marks a conceptual shift
in the existing TTS paradigm, framing the few-shot TTS problem as a VC task.
Furthermore, we propose to use a duration-controllable TTS system to create a
parallel speech corpus to facilitate the VC task. Results show that the Voice
Filter outperforms state-of-the-art few-shot speech synthesis techniques in
terms of objective and subjective metrics on one minute of speech on a diverse
set of voices, while being competitive against a TTS model built on 30 times
more data.
Related papers
- Accent Conversion in Text-To-Speech Using Multi-Level VAE and Adversarial Training [14.323313455208183]
Inclusive speech technology aims to erase any biases towards specific groups, such as people of certain accent.
We propose a TTS model that utilizes a Multi-Level Variational Autoencoder with adversarial learning to address accented speech synthesis and conversion.
arXiv Detail & Related papers (2024-06-03T05:56:02Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation [63.561944239071615]
StyleSpeech is a new TTS model which synthesizes high-quality speech and adapts to new speakers.
With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio.
We extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training.
arXiv Detail & Related papers (2021-06-06T15:34:11Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - AdaSpeech: Adaptive Text to Speech for Custom Voice [104.69219752194863]
We propose AdaSpeech, an adaptive TTS system for high-quality and efficient customization of new voices.
Experiment results show that AdaSpeech achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker.
arXiv Detail & Related papers (2021-03-01T13:28:59Z) - NAUTILUS: a Versatile Voice Cloning System [44.700803634034486]
NAUTILUS can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker.
It can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm.
It achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech.
arXiv Detail & Related papers (2020-05-22T05:00:20Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.