ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
- URL: http://arxiv.org/abs/2412.11795v2
- Date: Thu, 19 Dec 2024 15:21:44 GMT
- Title: ProsodyFM: Unsupervised Phrasing and Intonation Control for Intelligible Speech Synthesis
- Authors: Xiangheng He, Junjie Chen, Zixing Zhang, Björn W. Schuller,
- Abstract summary: Prosody contains rich information beyond the literal meaning of words.
Current models fall short in phrasing and intonation.
We propose ProsodyFM, a prosody-aware text-to-speech model.
- Score: 49.299344959162056
- License:
- Abstract: Prosody contains rich information beyond the literal meaning of words, which is crucial for the intelligibility of speech. Current models still fall short in phrasing and intonation; they not only miss or misplace breaks when synthesizing long sentences with complex structures but also produce unnatural intonation. We propose ProsodyFM, a prosody-aware text-to-speech synthesis (TTS) model with a flow-matching (FM) backbone that aims to enhance the phrasing and intonation aspects of prosody. ProsodyFM introduces two key components: a Phrase Break Encoder to capture initial phrase break locations, followed by a Duration Predictor for the flexible adjustment of break durations; and a Terminal Intonation Encoder which learns a bank of intonation shape tokens combined with a novel Pitch Processor for more robust modeling of human-perceived intonation change. ProsodyFM is trained with no explicit prosodic labels and yet can uncover a broad spectrum of break durations and intonation patterns. Experimental results demonstrate that ProsodyFM can effectively improve the phrasing and intonation aspects of prosody, thereby enhancing the overall intelligibility compared to four state-of-the-art (SOTA) models. Out-of-distribution experiments show that this prosody improvement can further bring ProsodyFM superior generalizability for unseen complex sentences and speakers. Our case study intuitively illustrates the powerful and fine-grained controllability of ProsodyFM over phrasing and intonation.
Related papers
- High-Fidelity Speech Synthesis with Minimal Supervision: All Using
Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations.
Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - A unified one-shot prosody and speaker conversion system with
self-supervised discrete speech units [94.64927912924087]
Existing systems ignore the correlation between prosody and language content, leading to degradation of naturalness in converted speech.
We devise a cascaded modular system leveraging self-supervised discrete speech units as language representation.
Experiments show that our system outperforms previous approaches in naturalness, intelligibility, speaker transferability, and prosody transferability.
arXiv Detail & Related papers (2022-11-12T00:54:09Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Fine-grained Noise Control for Multispeaker Speech Synthesis [3.449700218265025]
A text-to-speech (TTS) model typically factorizes speech attributes such as content, speaker and prosody into disentangled representations.
Recent works aim to additionally model the acoustic conditions explicitly, in order to disentangle the primary speech factors.
arXiv Detail & Related papers (2022-04-11T13:13:55Z) - Improving Prosody for Unseen Texts in Speech Synthesis by Utilizing
Linguistic Information and Noisy Data [20.132799566988826]
We propose to combine a fine-tuned BERT-based front-end with a pre-trained FastSpeech2-based acoustic model to improve prosody modeling.
Experimental results show that both the fine-tuned BERT model and the pre-trained FastSpeech 2 can improve prosody, especially for those structurally complex sentences.
arXiv Detail & Related papers (2021-11-15T05:58:29Z) - Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text.
We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z) - Improving Prosody Modelling with Cross-Utterance BERT Embeddings for
End-to-end Speech Synthesis [39.869097209615724]
Cross-utterance (CU) context vectors are produced by an additional CU encoder based on the sentence embeddings extracted by a pre-trained BERT model.
It is also found that the prosody can be controlled indirectly by changing the neighbouring sentences.
arXiv Detail & Related papers (2020-11-06T10:03:11Z) - Prosody Learning Mechanism for Speech Synthesis System Without Text
Length Limit [39.258370942013165]
A prosody learning mechanism is proposed to model the prosody of speech based on TTS system.
A novel self-attention structure, named as local attention, is proposed to lift this restriction of input text length.
Experiments on English and Mandarin show that speech with more satisfactory prosody has obtained in our model.
arXiv Detail & Related papers (2020-08-13T02:54:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.