Improving multi-speaker TTS prosody variance with a residual encoder and
normalizing flows
- URL: http://arxiv.org/abs/2106.05762v1
- Date: Thu, 10 Jun 2021 14:08:42 GMT
- Title: Improving multi-speaker TTS prosody variance with a residual encoder and
normalizing flows
- Authors: Iv\'an Vall\'es-P\'erez, Julian Roth, Grzegorz Beringer, Roberto
Barra-Chicote, Jasha Droppo
- Abstract summary: Disentanglement of speaker id and prosody is crucial in text-to-speech systems to improve on naturalness and produce more variable syntheses.
This paper proposes a new neural text-to-speech model that approaches the disentanglement problem by conditioning a Tacotron2-like architecture on flow-normalized speaker embeddings.
- Score: 9.515272632173884
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-to-speech systems recently achieved almost indistinguishable quality
from human speech. However, the prosody of those systems is generally flatter
than natural speech, producing samples with low expressiveness. Disentanglement
of speaker id and prosody is crucial in text-to-speech systems to improve on
naturalness and produce more variable syntheses. This paper proposes a new
neural text-to-speech model that approaches the disentanglement problem by
conditioning a Tacotron2-like architecture on flow-normalized speaker
embeddings, and by substituting the reference encoder with a new learned latent
distribution responsible for modeling the intra-sentence variability due to the
prosody. By removing the reference encoder dependency, the speaker-leakage
problem typically happening in this kind of systems disappears, producing more
distinctive syntheses at inference time. The new model achieves significantly
higher prosody variance than the baseline in a set of quantitative prosody
features, as well as higher speaker distinctiveness, without decreasing the
speaker intelligibility. Finally, we observe that the normalized speaker
embeddings enable much richer speaker interpolations, substantially improving
the distinctiveness of the new interpolated speakers.
Related papers
- CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Generalizable Zero-Shot Speaker Adaptive Speech Synthesis with
Disentangled Representations [12.388567657230116]
We propose a generalizable zero-shot speaker adaptive text-to-speech and voice conversion model.
GZS-TV introduces disentangled representation learning for speaker embedding extraction and timbre transformation.
Our experiments demonstrate that GZS-TV reduces performance degradation on unseen speakers and outperforms all baseline models in multiple datasets.
arXiv Detail & Related papers (2023-08-24T18:13:10Z) - Controllable speech synthesis by learning discrete phoneme-level
prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels.
We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios [143.47967241972995]
We develop AdaSpeech 4, a zero-shot adaptive TTS system for high-quality speech synthesis.
We model the speaker characteristics systematically to improve the generalization on new speakers.
Without any fine-tuning, AdaSpeech 4 achieves better voice quality and similarity than baselines in multiple datasets.
arXiv Detail & Related papers (2022-04-01T13:47:44Z) - Speaker Adaption with Intuitive Prosodic Features for Statistical
Parametric Speech Synthesis [50.5027550591763]
We propose a method of speaker adaption with intuitive prosodic features for statistical parametric speech synthesis.
The intuitive prosodic features are extracted at utterance-level or speaker-level, and are further integrated into the existing speaker-encoding-based and speaker-embedding-based adaptation frameworks respectively.
arXiv Detail & Related papers (2022-03-02T09:00:31Z) - Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention [0.0]
We propose a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech.
Generalization to long utterances is realized using an energy-based attention mechanism known as Dynamic Convolution Attention.
We compare several implementations of voice cloning systems in terms of speech naturalness, speaker similarity, alignment consistency and ability to synthesize long utterances.
arXiv Detail & Related papers (2022-01-25T15:06:07Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Continual Speaker Adaptation for Text-to-Speech Synthesis [2.3224617218247126]
In this paper, we look at TTS modeling from a continual learning perspective.
The goal is to add new speakers without forgetting previous speakers.
We exploit two well-known techniques for continual learning namely experience replay and weight regularization.
arXiv Detail & Related papers (2021-03-26T15:14:20Z) - Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit [39.258370942013165]
A prosody learning mechanism is proposed to model the prosody of speech based on TTS system.
A novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length.
Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model.
arXiv Detail & Related papers (2020-08-13T02:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.