Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis
- URL: http://arxiv.org/abs/2507.00227v1
- Date: Mon, 30 Jun 2025 19:52:32 GMT
- Title: Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis
- Authors: Paul Mayer, Florian Lux, Alejandro Pérez-González-de-Martos, Angelina Elizarova, Lindsey Vanderlyn, Dirk Väth, Ngoc Thang Vu,
- Abstract summary: generative methods have progressed rapidly in recent years, but generating expressive prosody for an utterance remains a challenging task.<n>In this work, we investigate the effectiveness of methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows.<n>Our extensive subjective and objective evaluations demonstrate that methods produce natural prosody on par with human speakers by capturing the inherent variability in human speech.
- Score: 59.68341193176134
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: While generative methods have progressed rapidly in recent years, generating expressive prosody for an utterance remains a challenging task in text-to-speech synthesis. This is particularly true for systems that model prosody explicitly through parameters such as pitch, energy, and duration, which is commonly done for the sake of interpretability and controllability. In this work, we investigate the effectiveness of stochastic methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows. We compare these methods to a traditional deterministic baseline, as well as to real human realizations. Our extensive subjective and objective evaluations demonstrate that stochastic methods produce natural prosody on par with human speakers by capturing the variability inherent in human speech. Further, they open up additional controllability options by allowing the sampling temperature to be tuned.
Related papers
- RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning [11.547937373256921]
A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction.<n>Control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics.<n>We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model.
arXiv Detail & Related papers (2025-07-05T10:59:00Z) - Learning minimal representations of stochastic processes with
variational autoencoders [52.99137594502433]
We introduce an unsupervised machine learning approach to determine the minimal set of parameters required to describe a process.
Our approach enables for the autonomous discovery of unknown parameters describing processes.
arXiv Detail & Related papers (2023-07-21T14:25:06Z) - Modeling Temporal Data as Continuous Functions with Stochastic Process
Diffusion [2.2849153854336763]
temporal data can be viewed as discretized measurements of the underlying function.
To build a generative model for such data we have to model the process that governs it.
We propose a solution by defining the denoising diffusion model in the function space.
arXiv Detail & Related papers (2022-11-04T17:02:01Z) - Fast and efficient speech enhancement with variational autoencoders [0.0]
Unsupervised speech enhancement based on variational autoencoders has shown promising performance compared with the commonly used supervised methods.
We propose a new approach based on Langevin dynamics that generates multiple sequences of samples and comes with a total variation-based regularization to incorporate temporal correlations of latent vectors.
Our experiments demonstrate that the developed framework makes an effective compromise between computational efficiency and enhancement quality, and outperforms existing methods.
arXiv Detail & Related papers (2022-11-02T09:52:13Z) - Zero-Shot Voice Conditioning for Denoising Diffusion TTS Models [95.97506031821217]
We present a novel way of conditioning a pretrained denoising diffusion speech model to produce speech in the voice of a novel person unseen during training.
The method requires a short (3 seconds) sample from the target person, and generation is steered at inference time, without any training steps.
arXiv Detail & Related papers (2022-06-05T19:45:29Z) - Locally Typical Sampling [84.62530743899025]
We show that today's probabilistic language generators fall short when it comes to producing coherent and fluent text.<n>We propose a simple and efficient procedure for enforcing this criterion when generating from probabilistic models.
arXiv Detail & Related papers (2022-02-01T18:58:45Z) - Leveraging Global Parameters for Flow-based Neural Posterior Estimation [90.21090932619695]
Inferring the parameters of a model based on experimental observations is central to the scientific method.
A particularly challenging setting is when the model is strongly indeterminate, i.e., when distinct sets of parameters yield identical observations.
We present a method for cracking such indeterminacy by exploiting additional information conveyed by an auxiliary set of observations sharing global parameters.
arXiv Detail & Related papers (2021-02-12T12:23:13Z) - Generating diverse and natural text-to-speech samples using a quantized
fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples.
We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.