Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
- URL: http://arxiv.org/abs/2412.08112v1
- Date: Wed, 11 Dec 2024 05:39:12 GMT
- Title: Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration
- Authors: Haowei Lou, Helen Paik, Wen Hu, Lina Yao,
- Abstract summary: We propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model.<n>Our experimental results show that aligner-guided duration labelling can achieve up to a 16% improvement in word error rate and significantly enhance phoneme and tone alignment.
- Score: 13.713209707407712
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advancements in text-to-speech (TTS) systems, such as FastSpeech and StyleSpeech, have significantly improved speech generation quality. However, these models often rely on duration generated by external tools like the Montreal Forced Aligner, which can be time-consuming and lack flexibility. The importance of accurate duration is often underestimated, despite their crucial role in achieving natural prosody and intelligibility. To address these limitations, we propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model. This approach reduces dependence on external tools and enhances alignment accuracy. We further explore the impact of different acoustic features, including Mel-Spectrograms, MFCCs, and latent features, on TTS model performance. Our experimental results show that aligner-guided duration labelling can achieve up to a 16\% improvement in word error rate and significantly enhance phoneme and tone alignment. These findings highlight the effectiveness of our approach in optimizing TTS systems for more natural and intelligible speech generation.
Related papers
- Test-Time Training for Speech Enhancement [2.9598903898834497]
This paper introduces a novel application of Test-Time Training (TTT) for Speech Enhancement.<n>It addresses the challenges posed by unpredictable noise conditions and domain shifts.<n>We show consistent improvements across speech quality metrics, outperforming the baseline model.
arXiv Detail & Related papers (2025-08-03T17:02:55Z) - Adaptive Duration Model for Text Speech Alignment [1.157734347781473]
Speech-to-text alignment is a critical component of neural text to-speech (TTS) models.<n>We propose a novel duration prediction framework that can give compromising phoneme-level duration distribution with given text.
arXiv Detail & Related papers (2025-07-30T12:31:11Z) - Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models [19.852233854729235]
Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments.<n>We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation.<n> Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality.
arXiv Detail & Related papers (2025-06-01T04:33:37Z) - DMOSpeech: Direct Metric Optimization via Distilled Diffusion Model in Zero-Shot Speech Synthesis [12.310318928818546]
We introduce DMOSpeech, a distilled diffusion-based TTS model that achieves both faster inference and superior performance compared to its teacher model.
Our comprehensive experiments, validated through extensive human evaluation, show significant improvements in naturalness, intelligibility, and speaker similarity while reducing inference time by orders of magnitude.
This work establishes a new framework for aligning speech synthesis with human auditory preferences through direct metric optimization.
arXiv Detail & Related papers (2024-10-14T21:17:58Z) - Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)
We present a simple yet effective automatic process for creating speech-text pair data.
Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech [43.45691362372739]
We propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS)
DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties.
Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
arXiv Detail & Related papers (2024-09-18T09:36:55Z) - SimpleSpeech 2: Towards Simple and Efficient Text-to-Speech with Flow-based Scalar Latent Transformer Diffusion Models [64.40250409933752]
We build upon our previous publication by implementing a simple and efficient non-autoregressive (NAR) TTS framework, termed SimpleSpeech 2.
SimpleSpeech 2 effectively combines the strengths of both autoregressive (AR) and non-autoregressive (NAR) methods.
We show a significant improvement in generation performance and generation speed compared to our previous work and other state-of-the-art (SOTA) large-scale TTS models.
arXiv Detail & Related papers (2024-08-25T17:07:39Z) - Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities.
We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z) - Weakly-supervised forced alignment of disfluent speech using
phoneme-level modeling [10.283092375534311]
We propose a simple and effective modification of alignment graph construction using weighted Finite State Transducers.
The proposed weakly-supervised approach alleviates the need for verbatim transcription of speech disfluencies for forced alignment.
Our evaluation on a corrupted version of the TIMIT test set and the UCLASS dataset shows significant improvements.
arXiv Detail & Related papers (2023-05-30T09:57:36Z) - Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS)
A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters.
Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - TAPLoss: A Temporal Acoustic Parameter Loss for Speech Enhancement [41.872384434583466]
We provide a differentiable estimator for four categories of low-level acoustic descriptors involving: frequency-related parameters, energy or amplitude-related parameters, spectral balance parameters, and temporal features.
We show that adding TAP as an auxiliary objective in speech enhancement produces speech with improved perceptual quality and intelligibility.
arXiv Detail & Related papers (2023-02-16T04:57:11Z) - Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model.
It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech.
It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z) - A study on the efficacy of model pre-training in developing neural
text-to-speech system [55.947807261757056]
This study aims to understand better why and how model pre-training can positively contribute to TTS system performance.
It is found that the TTS system could achieve comparable performance when the pre-training data is reduced to 1/8 of its original size.
arXiv Detail & Related papers (2021-10-08T02:09:28Z) - Voice2Series: Reprogramming Acoustic Models for Time Series
Classification [65.94154001167608]
Voice2Series is a novel end-to-end approach that reprograms acoustic models for time series classification.
We show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%.
arXiv Detail & Related papers (2021-06-17T07:59:15Z) - Time-domain Speech Enhancement with Generative Adversarial Learning [53.74228907273269]
This paper proposes a new framework called Time-domain Speech Enhancement Generative Adversarial Network (TSEGAN)
TSEGAN is an extension of the generative adversarial network (GAN) in time-domain with metric evaluation to mitigate the scaling problem.
In addition, we provide a new method based on objective function mapping for the theoretical analysis of the performance of Metric GAN.
arXiv Detail & Related papers (2021-03-30T08:09:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.