Related papers: ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs

ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs

URL: http://arxiv.org/abs/2508.09389v1
Date: Tue, 12 Aug 2025 23:12:18 GMT
Title: ProMode: A Speech Prosody Model Conditioned on Acoustic and Textual Inputs
Authors: Eray Eren, Qingju Liu, Hyeongwoo Kim, Pablo Garrido, Abeer Alwan,
Abstract summary: Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies.<n>We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy.<n>We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines.
Score: 15.494830253163572
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Prosody conveys rich emotional and semantic information of the speech signal as well as individual idiosyncrasies. We propose a stand-alone model that maps text-to-prosodic features such as F0 and energy and can be used in downstream tasks such as TTS. The ProMode encoder takes as input acoustic features and time-aligned textual content, both are partially masked, and obtains a fixed-length latent prosodic embedding. The decoder predicts acoustics in the masked region using both the encoded prosody input and unmasked textual content. Trained on the GigaSpeech dataset, we compare our method with state-of-the-art style encoders. For F0 and energy predictions, we show consistent improvements for our model at different levels of granularity. We also integrate these predicted prosodic features into a TTS system and conduct perceptual tests, which show higher prosody preference compared to the baselines, demonstrating the model's potential in tasks where prosody modeling is important.

Related papers

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis [57.5830191022097]
A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation.<n>We adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs.<n>Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines.
arXiv Detail & Related papers (2025-11-07T17:07:56Z)
DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech [8.537791317883576]
Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample.<n>Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts.<n>We introduce DiFlow-TTS, which is the first model to explore purely Discrete Flow Matching for speech synthesis.
arXiv Detail & Related papers (2025-09-11T17:16:52Z)
CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens [49.569695524535454]
We propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis.
arXiv Detail & Related papers (2024-07-07T15:16:19Z)
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors [8.419383213705789]
We introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors.<n>We find that DiT with minimal modifications outperforms U-Net, variable-length modeling with a speech length predictor, and conditions like semantic alignment in speech latent representations are key to further enhancement.
arXiv Detail & Related papers (2024-06-17T11:25:57Z)
Style Description based Text-to-Speech with Conditional Prosodic Layer Normalization based Diffusion GAN [17.876323494898536]
We present a Diffusion GAN based approach (Prosodic Diff-TTS) to generate the corresponding high-fidelity speech based on the style description and content text as an input to generate speech samples within only 4 denoising steps. We demonstrate the efficacy of our proposed architecture on multi-speaker LibriTTS and PromptSpeech datasets, using multiple quantitative metrics that measure generated accuracy and MOS.
arXiv Detail & Related papers (2023-10-27T14:28:41Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
Minimally-Supervised Speech Synthesis with Conditional Diffusion Model and Language Model: A Comparative Study of Semantic Coding [57.42429912884543]
We propose Diff-LM-Speech, Tetra-Diff-Speech and Tri-Diff-Speech to solve high dimensionality and waveform distortion problems. We also introduce a prompt encoder structure based on a variational autoencoder and a prosody bottleneck to improve prompt representation ability. Experimental results show that our proposed methods outperform baseline methods.
arXiv Detail & Related papers (2023-07-28T11:20:23Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder. Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z)
Unsupervised TTS Acoustic Modeling for TTS with Conditional Disentangled Sequential VAE [36.50265124324876]
We propose a novel unsupervised text-to-speech acoustic model training scheme, named UTTS, which does not require text-audio pairs. The framework offers a flexible choice of a speaker's duration model, timbre feature (identity) and content for TTS inference. Experiments demonstrate that UTTS can synthesize speech of high naturalness and intelligibility measured by human and objective evaluations.
arXiv Detail & Related papers (2022-06-06T11:51:22Z)
ProsoSpeech: Enhancing Prosody With Quantized Vector Pre-training in Text-to-Speech [96.0009517132463]
We introduce a word-level prosody encoder, which quantizes the low-frequency band of the speech and compresses prosody attributes in the latent prosody vector (LPV) We then introduce an LPV predictor, which predicts LPV given word sequence and fine-tune it on the high-quality TTS dataset. Experimental results show that ProsoSpeech can generate speech with richer prosody compared with baseline methods.
arXiv Detail & Related papers (2022-02-16T01:42:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.