Related papers: Adaptive Duration Model for Text Speech Alignment

Adaptive Duration Model for Text Speech Alignment

URL: http://arxiv.org/abs/2507.22612v1
Date: Wed, 30 Jul 2025 12:31:11 GMT
Title: Adaptive Duration Model for Text Speech Alignment
Authors: Junjie Cao,
Abstract summary: Speech-to-text alignment is a critical component of neural text to-speech (TTS) models.<n>We propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text.
Score: 1.157734347781473
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Speech-to-text alignment is a critical component of neural text to-speech (TTS) models. Autoregressive TTS models typically use an attention mechanism to learn these alignments on-line. However, these alignments tend to be brittle and often fail to generalize to long utterances and out-of-domain text, leading to missing or repeating words. Most non-autoregressive end to-end TTS models rely on durations extracted from external sources, using additional duration models for alignment. In this paper, we propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text. In our experiments, the proposed duration model has more precise prediction and condition adaptation ability compared to previous baseline models. Numerically, it has roughly a 11.3 percents immprovement on alignment accuracy, and makes the performance of zero-shot TTS models more robust to the mismatch between prompt audio and input audio.

Related papers

Pseudo-Autoregressive Neural Codec Language Models for Efficient Zero-Shot Text-to-Speech Synthesis [64.12708207721276]
We introduce a novel pseudo-autoregressive (PAR) language modeling approach that unifies AR and NAR modeling.<n>Building on PAR, we propose PALLE, a two-stage TTS system that leverages PAR for initial generation followed by NAR refinement.<n>Experiments demonstrate that PALLE, trained on LibriTTS, outperforms state-of-the-art systems trained on large-scale data.
arXiv Detail & Related papers (2025-04-14T16:03:21Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration [13.713209707407712]
We propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model.<n>Our experimental results show that aligner-guided duration labelling can achieve up to a 16% improvement in word error rate and significantly enhance phoneme and tone alignment.
arXiv Detail & Related papers (2024-12-11T05:39:12Z)
SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2. SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods. We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z)
DiTTo-TTS: Diffusion Transformers for Scalable Text-to-Speech without Domain-Specific Factors [8.419383213705789]
We introduce DiTTo-TTS, a Diffusion Transformer (DiT)-based TTS model, to investigate whether LDM-based TTS can achieve state-of-the-art performance without domain-specific factors.<n>We find that DiT with minimal modifications outperforms U-Net, variable-length modeling with a speech length predictor, and conditions like semantic alignment in speech latent representations are key to further enhancement.
arXiv Detail & Related papers (2024-06-17T11:25:57Z)
Text Injection for Neural Contextual Biasing [57.589903308622745]
This work proposes contextual text injection (CTI) to enhance contextual ASR. CTI with 100 billion text sentences can achieve up to 43.3% relative WER reduction from a strong neural biasing model.
arXiv Detail & Related papers (2024-06-05T04:20:17Z)
Mega-TTS: Zero-Shot Text-to-Speech at Scale with Intrinsic Inductive Bias [71.94109664001952]
Mega-TTS is a novel zero-shot TTS system that is trained with large-scale wild data. We show that Mega-TTS surpasses state-of-the-art TTS systems on zero-shot TTS speech editing, and cross-lingual TTS tasks.
arXiv Detail & Related papers (2023-06-06T08:54:49Z)
Duration-aware pause insertion using pre-trained language model for multi-speaker text-to-speech [40.65850332919397]
We propose more powerful pause insertion frameworks based on a pre-trained language model. Our approach uses bidirectional encoder representations from transformers (BERT) pre-trained on a large-scale text corpus. We also leverage duration-aware pause insertion for more natural multi-speaker TTS.
arXiv Detail & Related papers (2023-02-27T10:40:41Z)
ParaTTS: Learning Linguistic and Prosodic Cross-sentence Information in Paragraph-based TTS [19.988974534582205]
We propose to model linguistic and prosodic information by considering cross-sentence, embedded structure in training. We trained on a storytelling audio-book corpus (4.08 hours), recorded by a female Mandarin Chinese speaker. The proposed TTS model demonstrates that it can produce rather natural and good-quality speech paragraph-wise.
arXiv Detail & Related papers (2022-09-14T08:34:16Z)
A study on the efficacy of model pre-training in developing neural text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance. It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z)
One TTS Alignment To Rule Them All [26.355019468082247]
Speech-to-text alignment is a critical component of neural textto-speech (TTS) models. In this paper we leverage the alignment mechanism proposed in RAD-TTS as a generic alignment learning framework. The framework combines forward-sum algorithm, the Viterbi algorithm, and a simple and efficient static prior.
arXiv Detail & Related papers (2021-08-23T23:45:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.