Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
- URL: http://arxiv.org/abs/2106.08352v1
- Date: Tue, 15 Jun 2021 18:03:48 GMT
- Title: Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis
- Authors: Devang S Ram Mohan, Vivian Hu, Tian Huey Teh, Alexandra
Torresquintero, Christopher G. R. Wallis, Marlene Staib, Lorenzo Foglianti,
Jiameng Gao, Simon King
- Abstract summary: Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
- Score: 68.76620947298595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text does not fully specify the spoken form, so text-to-speech models must be
able to learn from speech data that vary in ways not explained by the
corresponding text. One way to reduce the amount of unexplained variation in
training data is to provide acoustic information as an additional learning
signal. When generating speech, modifying this acoustic information enables
multiple distinct renditions of a text to be produced.
Since much of the unexplained variation is in the prosody, we propose a model
that generates speech explicitly conditioned on the three primary acoustic
correlates of prosody: $F_{0}$, energy and duration. The model is flexible
about how the values of these features are specified: they can be externally
provided, or predicted from text, or predicted then subsequently modified.
Compared to a model that employs a variational auto-encoder to learn
unsupervised latent features, our model provides more interpretable,
temporally-precise, and disentangled control. When automatically predicting the
acoustic features from text, it generates speech that is more natural than that
from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop
modification of the predicted acoustic features can significantly further
increase naturalness.
Related papers
- NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models [127.47252277138708]
We propose NaturalSpeech 3, a TTS system with factorized diffusion models to generate natural speech in a zero-shot way.
Specifically, we design a neural with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details.
Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility.
arXiv Detail & Related papers (2024-03-05T16:35:25Z) - Natural language guidance of high-fidelity text-to-speech with synthetic
annotations [13.642358232817342]
We propose a scalable method for labeling various aspects of speaker identity, style, and recording conditions.
We then apply this method to a 45k hour dataset, which we use to train a speech language model.
Our results demonstrate high-fidelity speech generation in a diverse range of accents, prosodic styles, channel conditions, and acoustic conditions.
arXiv Detail & Related papers (2024-02-02T21:29:34Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - Learning and controlling the source-filter representation of speech with
a variational autoencoder [23.05989605017053]
In speech processing, the source-filter model considers that speech signals are produced from a few independent and physically meaningful continuous latent factors.
We propose a method to accurately and independently control the source-filter speech factors within the latent subspaces.
Without requiring additional information such as text or human-labeled data, this results in a deep generative model of speech spectrograms.
arXiv Detail & Related papers (2022-04-14T16:13:06Z) - WAVPROMPT: Towards Few-Shot Spoken Language Understanding with Frozen
Language Models [57.557319372969495]
Large-scale auto-regressive language models pretrained on massive text have demonstrated their impressive ability to perform new natural language tasks.
Recent studies further show that such a few-shot learning ability can be extended to the text-image setting by training an encoder to encode the images into embeddings.
We propose a novel speech understanding framework, WavPrompt, where we finetune a wav2vec model to generate a sequence of audio embeddings understood by the language model.
arXiv Detail & Related papers (2022-03-29T19:08:55Z) - ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in
Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV)
We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset.
Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z) - FastPitchFormant: Source-filter based Decomposed Modeling for Speech
Synthesis [6.509758931804479]
We propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory.
FastPitchFormant has a unique structure that handles text and acoustic features in parallel.
arXiv Detail & Related papers (2021-06-29T07:06:42Z) - Controllable neural text-to-speech synthesis using intuitive prosodic
features [3.709803838880226]
We train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions.
Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
arXiv Detail & Related papers (2020-09-14T22:37:44Z) - End-to-End Adversarial Text-to-Speech [33.01223309795122]
We learn to synthesise speech from normalised text or phonemes in an end-to-end manner.
Our proposed generator is feed-forward and thus efficient for both training and inference.
It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses.
arXiv Detail & Related papers (2020-06-05T17:41:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.