Cross-speaker style transfer for text-to-speech using data augmentation
- URL: http://arxiv.org/abs/2202.05083v1
- Date: Thu, 10 Feb 2022 15:10:56 GMT
- Title: Cross-speaker style transfer for text-to-speech using data augmentation
- Authors: Manuel Sam Ribeiro, Julian Roth, Giulia Comini, Goeric Huybrechts,
Adam Gabrys, Jaime Lorenzo-Trueba
- Abstract summary: We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion.
We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers.
We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages.
- Score: 11.686745250628247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We address the problem of cross-speaker style transfer for text-to-speech
(TTS) using data augmentation via voice conversion. We assume to have a corpus
of neutral non-expressive data from a target speaker and supporting
conversational expressive data from different speakers. Our goal is to build a
TTS system that is expressive, while retaining the target speaker's identity.
The proposed approach relies on voice conversion to first generate high-quality
data from the set of supporting expressive speakers. The voice converted data
is then pooled with natural data from the target speaker and used to train a
single-speaker multi-style TTS system. We provide evidence that this approach
is efficient, flexible, and scalable. The method is evaluated using one or more
supporting speakers, as well as a variable amount of supporting data. We
further provide evidence that this approach allows some controllability of
speaking style, when using multiple supporting speakers. We conclude by scaling
our proposed technology to a set of 14 speakers across 7 languages. Results
indicate that our technology consistently improves synthetic samples in terms
of style similarity, while retaining the target speaker's identity.
Related papers
- Accent conversion using discrete units with parallel data synthesized from controllable accented TTS [56.18382038512251]
The goal of accent conversion (AC) is to convert speech accents while preserving content and speaker identity.
Previous methods either required reference utterances during inference, did not preserve speaker identity well, or used one-to-one systems that could only be trained for each non-native accent.
This paper presents a promising AC model that can convert many accents into native to overcome these issues.
arXiv Detail & Related papers (2024-09-30T19:52:10Z) - Speech-to-Speech Translation with Discrete-Unit-Based Style Transfer [53.72998363956454]
Direct speech-to-speech translation (S2ST) with discrete self-supervised representations has achieved remarkable accuracy.
The scarcity of high-quality speaker-parallel data poses a challenge for learning style transfer during translation.
We design an S2ST pipeline with style-transfer capability on the basis of discrete self-supervised speech representations and timbre units.
arXiv Detail & Related papers (2023-09-14T09:52:08Z) - ERNIE-SAT: Speech and Text Joint Pretraining for Cross-Lingual
Multi-Speaker Text-to-Speech [58.93395189153713]
We extend the pretraining method for cross-lingual multi-speaker speech synthesis tasks.
We propose a speech-text joint pretraining framework, where we randomly mask the spectrogram and the phonemes.
Our model shows great improvements over speaker-embedding-based multi-speaker TTS methods.
arXiv Detail & Related papers (2022-11-07T13:35:16Z) - Textless Speech-to-Speech Translation on Real Data [49.134208897722246]
We present a textless speech-to-speech translation (S2ST) system that can translate speech from one language into another language.
We tackle the challenge in modeling multi-speaker target speech and train the systems with real-world S2ST data.
arXiv Detail & Related papers (2021-12-15T18:56:35Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - Improve Cross-lingual Voice Cloning Using Low-quality Code-switched Data [11.18504333789534]
We propose to use low-quality code-switched found data from the non-target speakers to achieve cross-lingual voice cloning for the target speakers.
Experiments show that our proposed method can generate high-quality code-switched speech in the target voices in terms of both naturalness and speaker consistency.
arXiv Detail & Related papers (2021-10-14T08:16:06Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - Low-resource expressive text-to-speech using data augmentation [12.396086122947679]
We present a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data.
We augment data via voice conversion by leveraging recordings in the desired speaking style from other speakers.
Next, we use that synthetic data on top of the available recordings to train a TTS model.
arXiv Detail & Related papers (2020-11-11T11:22:37Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.