Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
- URL: http://arxiv.org/abs/2008.02950v1
- Date: Fri, 7 Aug 2020 02:03:27 GMT
- Title: Multi-speaker Text-to-speech Synthesis Using Deep Gaussian Processes
- Authors: Kentaro Mitsui, Tomoki Koriyama, Hiroshi Saruwatari
- Abstract summary: Multi-speaker speech synthesis is a technique for modeling multiple speakers' voices with a single model.
We propose a framework for multi-speaker speech synthesis using deep Gaussian processes (DGPs) and latent variable models (DGPLVMs)
- Score: 36.63589873242547
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Multi-speaker speech synthesis is a technique for modeling multiple speakers'
voices with a single model. Although many approaches using deep neural networks
(DNNs) have been proposed, DNNs are prone to overfitting when the amount of
training data is limited. We propose a framework for multi-speaker speech
synthesis using deep Gaussian processes (DGPs); a DGP is a deep architecture of
Bayesian kernel regressions and thus robust to overfitting. In this framework,
speaker information is fed to duration/acoustic models using speaker codes. We
also examine the use of deep Gaussian process latent variable models (DGPLVMs).
In this approach, the representation of each speaker is learned simultaneously
with other model parameters, and therefore the similarity or dissimilarity of
speakers is considered efficiently. We experimentally evaluated two situations
to investigate the effectiveness of the proposed methods. In one situation, the
amount of data from each speaker is balanced (speaker-balanced), and in the
other, the data from certain speakers are limited (speaker-imbalanced).
Subjective and objective evaluation results showed that both the DGP and DGPLVM
synthesize multi-speaker speech more effective than a DNN in the
speaker-balanced situation. We also found that the DGPLVM outperforms the DGP
significantly in the speaker-imbalanced situation.
Related papers
- DASA: Difficulty-Aware Semantic Augmentation for Speaker Verification [55.306583814017046]
We present a novel difficulty-aware semantic augmentation (DASA) approach for speaker verification.
DASA generates diversified training samples in speaker embedding space with negligible extra computing cost.
The best result achieves a 14.6% relative reduction in EER metric on CN-Celeb evaluation set.
arXiv Detail & Related papers (2023-10-18T17:07:05Z) - Disentangling Voice and Content with Self-Supervision for Speaker
Recognition [57.446013973449645]
This paper proposes a disentanglement framework that simultaneously models speaker traits and content variability in speech.
It is validated with experiments conducted on the VoxCeleb and SITW datasets with 9.56% and 8.24% average reductions in EER and minDCF.
arXiv Detail & Related papers (2023-10-02T12:02:07Z) - ECAPA-TDNN for Multi-speaker Text-to-speech Synthesis [13.676243543864347]
We propose an end-to-end method that is able to generate high-quality speech and better similarity for both seen and unseen speakers.
The method consists of three separately trained components: a speaker encoder based on the state-of-the-art TDNN-based ECAPA-TDNN, a FastSpeech2 based synthesizer, and a HiFi-GAN vocoder.
arXiv Detail & Related papers (2022-03-20T07:04:26Z) - Speaker Embedding-aware Neural Diarization: a Novel Framework for
Overlapped Speech Diarization in the Meeting Scenario [51.5031673695118]
We reformulate overlapped speech diarization as a single-label prediction problem.
We propose the speaker embedding-aware neural diarization (SEND) system.
arXiv Detail & Related papers (2022-03-18T06:40:39Z) - Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech [62.95422526044178]
We use Model Agnostic Meta-Learning (MAML) as the training algorithm of a multi-speaker TTS model.
We show that Meta-TTS can synthesize high speaker-similarity speech from few enrollment samples with fewer adaptation steps than the speaker adaptation baseline.
arXiv Detail & Related papers (2021-11-07T09:53:31Z) - GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints [36.07346889498981]
We propose GC-TTS which achieves high-quality speaker adaptation with significantly improved speaker similarity.
A TTS model is pre-trained for base speakers with a sufficient amount of data, and then fine-tuned for novel speakers on a few minutes of data with two geometric constraints.
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
arXiv Detail & Related papers (2021-08-16T04:25:31Z) - GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech
Synthesis [6.632254395574993]
GANSpeech is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model.
In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models.
arXiv Detail & Related papers (2021-06-29T08:15:30Z) - Streaming Multi-speaker ASR with RNN-T [8.701566919381223]
This work focuses on multi-speaker speech recognition based on a recurrent neural network transducer (RNN-T)
We show that guiding separation with speaker order labels in the former case enhances the high-level speaker tracking capability of RNN-T.
Our best model achieves a WER of 10.2% on simulated 2-speaker Libri data, which is competitive with the previously reported state-of-the-art nonstreaming model (10.3%)
arXiv Detail & Related papers (2020-11-23T19:10:40Z) - DNN Speaker Tracking with Embeddings [0.0]
We propose a novel embedding-based speaker tracking method.
Our design is based on a convolutional neural network that mimics a typical speaker verification PLDA.
To make the baseline system similar to speaker tracking, non-target speakers were added to the recordings.
arXiv Detail & Related papers (2020-07-13T18:40:14Z) - Semi-supervised Learning for Multi-speaker Text-to-speech Synthesis
Using Discrete Speech Representation [125.59372403631006]
We propose a semi-supervised learning approach for multi-speaker text-to-speech (TTS)
A multi-speaker TTS model can learn from the untranscribed audio via the proposed encoder-decoder framework with discrete speech representation.
We found the model can benefit from the proposed semi-supervised learning approach even when part of the unpaired speech data is noisy.
arXiv Detail & Related papers (2020-05-16T15:47:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.