Prosody-controllable spontaneous TTS with neural HMMs
- URL: http://arxiv.org/abs/2211.13533v2
- Date: Thu, 1 Jun 2023 10:51:23 GMT
- Title: Prosody-controllable spontaneous TTS with neural HMMs
- Authors: Harm Lameris, Shivam Mehta, Gustav Eje Henter, Joakim Gustafson, \'Eva
Sz\'ekely
- Abstract summary: We propose a TTS architecture that can rapidly learn to speak from small and irregular datasets.
We add utterance-level prosody control to an existing neural HMM-based TTS system.
We evaluate the system's capability of synthesizing two types of creaky voice.
- Score: 11.472325158964646
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Spontaneous speech has many affective and pragmatic functions that are
interesting and challenging to model in TTS. However, the presence of reduced
articulation, fillers, repetitions, and other disfluencies in spontaneous
speech make the text and acoustics less aligned than in read speech, which is
problematic for attention-based TTS. We propose a TTS architecture that can
rapidly learn to speak from small and irregular datasets, while also
reproducing the diversity of expressive phenomena present in spontaneous
speech. Specifically, we add utterance-level prosody control to an existing
neural HMM-based TTS system which is capable of stable, monotonic alignments
for spontaneous speech. We objectively evaluate control accuracy and perform
perceptual tests that demonstrate that prosody control does not degrade
synthesis quality. To exemplify the power of combining prosody control and
ecologically valid data for reproducing intricate spontaneous speech phenomena,
we evaluate the system's capability of synthesizing two types of creaky voice.
Audio samples are available at
https://www.speech.kth.se/tts-demos/prosodic-hmm/
Related papers
- Spontaneous Style Text-to-Speech Synthesis with Controllable Spontaneous Behaviors Based on Language Models [55.898594710420326]
We propose a novel spontaneous speech synthesis system based on language models.
Fine-grained prosody modeling is introduced to enhance the model's ability to capture subtle prosody variations in spontaneous speech.
arXiv Detail & Related papers (2024-07-18T13:42:38Z) - Diff-TTSG: Denoising probabilistic integrated speech and gesture
synthesis [19.35266496960533]
We present the first diffusion-based probabilistic model, called Diff-TTSG, that jointly learns to synthesise speech and gestures together.
We describe a set of careful uni- and multi-modal subjective tests for evaluating integrated speech and gesture synthesis systems.
arXiv Detail & Related papers (2023-06-15T18:02:49Z) - ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech
Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model.
It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label.
Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - A Vector Quantized Approach for Text to Speech Synthesis on Real-World
Spontaneous Speech [94.64927912924087]
We train TTS systems using real-world speech from YouTube and podcasts.
Recent Text-to-Speech architecture is designed for multiple code generation and monotonic alignment.
We show thatRecent Text-to-Speech architecture outperforms existing TTS systems in several objective and subjective measures.
arXiv Detail & Related papers (2023-02-08T17:34:32Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Hierarchical prosody modeling and control in non-autoregressive parallel
neural TTS [7.531331499935223]
We train a non-autoregressive parallel neural TTS model hierarchically conditioned on coarse and fine-grained acoustic speech features.
Experiments show that a non-autoregressive TTS model hierarchically conditioned on utterance-wise pitch, pitch range, duration, energy, and spectral tilt can effectively control each prosodic dimension.
arXiv Detail & Related papers (2021-10-06T17:58:42Z) - AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style [111.89762723159677]
We develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech.
AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
arXiv Detail & Related papers (2021-07-06T10:40:45Z) - Controllable neural text-to-speech synthesis using intuitive prosodic
features [3.709803838880226]
We train a sequence-to-sequence neural network conditioned on acoustic speech features to learn a latent prosody space with intuitive and meaningful dimensions.
Experiments show that a model conditioned on sentence-wise pitch, pitch range, phone duration, energy, and spectral tilt can effectively control each prosodic dimension and generate a wide variety of speaking styles.
arXiv Detail & Related papers (2020-09-14T22:37:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.