Predicting phoneme-level prosody latents using AR and flow-based Prior
Networks for expressive speech synthesis
- URL: http://arxiv.org/abs/2211.01327v1
- Date: Wed, 2 Nov 2022 17:45:01 GMT
- Title: Predicting phoneme-level prosody latents using AR and flow-based Prior
Networks for expressive speech synthesis
- Authors: Konstantinos Klapsas, Karolos Nikitaras, Nikolaos Ellinas, June Sig
Sung, Inchul Hwang, Spyros Raptis, Aimilios Chalamandaris, Pirros Tsiakoulis
- Abstract summary: We show that normalizing flow based prior networks can result in more expressive speech at the cost of a slight drop in quality.
We also propose a Dynamical VAE model, that can generate higher quality speech although with decreased expressiveness and variability compared to the flow based models.
- Score: 3.6159128762538018
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A large part of the expressive speech synthesis literature focuses on
learning prosodic representations of the speech signal which are then modeled
by a prior distribution during inference. In this paper, we compare different
prior architectures at the task of predicting phoneme level prosodic
representations extracted with an unsupervised FVAE model. We use both
subjective and objective metrics to show that normalizing flow based prior
networks can result in more expressive speech at the cost of a slight drop in
quality. Furthermore, we show that the synthesized speech has higher
variability, for a given text, due to the nature of normalizing flows. We also
propose a Dynamical VAE model, that can generate higher quality speech although
with decreased expressiveness and variability compared to the flow based
models.
Related papers
- Investigating Disentanglement in a Phoneme-level Speech Codec for Prosody Modeling [39.80957479349776]
We investigate the prosody modeling capabilities of the discrete space of an RVQ-VAE model, modifying it to operate on the phoneme-level.
We show that the phoneme-level discrete latent representations achieves a high degree of disentanglement, capturing fine-grained prosodic information that is robust and transferable.
arXiv Detail & Related papers (2024-09-13T09:27:05Z) - uSee: Unified Speech Enhancement and Editing with Conditional Diffusion
Models [57.71199494492223]
We propose a Unified Speech Enhancement and Editing (uSee) model with conditional diffusion models to handle various tasks at the same time in a generative manner.
Our experiments show that our proposed uSee model can achieve superior performance in both speech denoising and dereverberation compared to other related generative speech enhancement models.
arXiv Detail & Related papers (2023-10-02T04:36:39Z) - A Discourse-level Multi-scale Prosodic Model for Fine-grained Emotion
Analysis [19.271542595753267]
This paper explores predicting suitable prosodic features for fine-grained emotion analysis from the discourse-level text.
We propose a Discourse-level Multi-scale text Prosodic Model (D-MPM) that exploits multi-scale text to predict these two prosodic features.
arXiv Detail & Related papers (2023-09-21T07:45:44Z) - Bridging Speech and Textual Pre-trained Models with Unsupervised ASR [70.61449720963235]
This work proposes a simple yet efficient unsupervised paradigm that connects speech and textual pre-trained models.
We show that unsupervised automatic speech recognition (ASR) can improve the representations from speech self-supervised models.
Notably, on spoken question answering, we reach the state-of-the-art result over the challenging NMSQA benchmark.
arXiv Detail & Related papers (2022-11-06T04:50:37Z) - A Systematic Comparison of Phonetic Aware Techniques for Speech
Enhancement [20.329872147913584]
We compare different methods of incorporating phonetic information in a speech enhancement model.
We observe the influence of different phonetic content models as well as various feature-injection techniques on enhancement performance.
arXiv Detail & Related papers (2022-06-22T12:00:50Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - Discretization and Re-synthesis: an alternative method to solve the
Cocktail Party Problem [65.25725367771075]
This study demonstrates, for the first time, that the synthesis-based approach can also perform well on this problem.
Specifically, we propose a novel speech separation/enhancement model based on the recognition of discrete symbols.
By utilizing the synthesis model with the input of discrete symbols, after the prediction of discrete symbol sequence, each target speech could be re-synthesized.
arXiv Detail & Related papers (2021-12-17T08:35:40Z) - Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings.
We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.