Learning Speaker Embedding from Text-to-Speech
- URL: http://arxiv.org/abs/2010.11221v1
- Date: Wed, 21 Oct 2020 18:03:16 GMT
- Title: Learning Speaker Embedding from Text-to-Speech
- Authors: Jaejin Cho, Piotr Zelasko, Jesus Villalba, Shinji Watanabe, Najim
Dehak
- Abstract summary: We jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion.
We investigated training TTS from either manual or ASR-generated transcripts.
Unsupervised TTS embeddings improved EER by 2.06% absolute with regard to i-vectors for the LibriTTS dataset.
- Score: 59.80309164404974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Zero-shot multi-speaker Text-to-Speech (TTS) generates target speaker voices
given an input text and the corresponding speaker embedding. In this work, we
investigate the effectiveness of the TTS reconstruction objective to improve
representation learning for speaker verification. We jointly trained end-to-end
Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion. We
hypothesize that the embeddings will contain minimal phonetic information since
the TTS decoder will obtain that information from the textual input. TTS
reconstruction can also be combined with speaker classification to enhance
these embeddings further. Once trained, the speaker encoder computes
representations for the speaker verification task, while the rest of the TTS
blocks are discarded. We investigated training TTS from either manual or
ASR-generated transcripts. The latter allows us to train embeddings on datasets
without manual transcripts. We compared ASR transcripts and Kaldi phone
alignments as TTS inputs, showing that the latter performed better due to their
finer resolution. Unsupervised TTS embeddings improved EER by 2.06\% absolute
with regard to i-vectors for the LibriTTS dataset. TTS with speaker
classification loss improved EER by 0.28\% and 0.73\% absolutely from a model
using only speaker classification loss in LibriTTS and Voxceleb1 respectively.
Related papers
- Improving Accented Speech Recognition using Data Augmentation based on Unsupervised Text-to-Speech Synthesis [30.97784092953007]
This paper investigates the use of unsupervised text-to-speech synthesis (TTS) as a data augmentation method to improve accented speech recognition.
TTS systems are trained with a small amount of accented speech training data and their pseudo-labels rather than manual transcriptions.
This approach enables the use of accented speech data without manual transcriptions to perform data augmentation for accented speech recognition.
arXiv Detail & Related papers (2024-07-04T16:42:24Z) - DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech [30.110058338155675]
Cross-lingual text-to-speech (CTTS) is still far from satisfactory as it is difficult to accurately retain the speaker timbres.
We propose a novel dual speaker embedding TTS (DSE-TTS) framework for CTTS with authentic speaking style.
By combining both embeddings, DSE-TTS significantly outperforms the state-of-the-art SANE-TTS in cross-lingual synthesis.
arXiv Detail & Related papers (2023-06-25T06:46:36Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech
with Untranscribed Data [25.709370310448328]
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data.
We train the speaker-conditional diffusion model on large-scale untranscribed datasets for a classifier-free guidance method.
We demonstrate that Guided-TTS 2 shows comparable performance to high-quality single-speaker TTS baselines in terms of speech quality and speaker similarity with only a tensecond untranscribed data.
arXiv Detail & Related papers (2022-05-30T18:30:20Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Transfer Learning Framework for Low-Resource Text-to-Speech using a
Large-Scale Unlabeled Speech Corpus [10.158584616360669]
Training a text-to-speech (TTS) model requires a large scale text labeled speech corpus.
We propose a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training.
arXiv Detail & Related papers (2022-03-29T11:26:56Z) - Guided-TTS:Text-to-Speech with Untranscribed Speech [22.548875263927396]
We present Guided-TTS, a high-quality TTS model that learns to generate speech from untranscribed speech data.
For text-to-speech synthesis, we guide the generative process of the unconditional DDPM via phoneme classification to produce mel-spectrograms.
arXiv Detail & Related papers (2021-11-23T10:05:05Z) - AdaSpeech 2: Adaptive Text to Speech with Untranscribed Data [115.38309338462588]
We develop AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation.
Specifically, we introduce a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction.
In adaptation, we use untranscribed speech data for speech reconstruction and only fine-tune the TTS decoder.
arXiv Detail & Related papers (2021-04-20T01:53:30Z) - MultiSpeech: Multi-Speaker Text to Speech with Transformer [145.56725956639232]
Transformer-based text to speech (TTS) model (e.g., Transformer TTSciteli 2019neural, FastSpeechciteren 2019fastspeech) has shown the advantages of training and inference efficiency over RNN-based model.
We develop a robust and high-quality multi-speaker Transformer TTS system called MultiSpeech, with several specially designed components/techniques to improve text-to-speech alignment.
arXiv Detail & Related papers (2020-06-08T15:05:28Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.