Related papers: Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models

URL: http://arxiv.org/abs/2510.13293v1
Date: Wed, 15 Oct 2025 08:37:16 GMT
Title: Mismatch Aware Guidance for Robust Emotion Control in Auto-Regressive TTS Models
Authors: Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma,
Abstract summary: Free Guidance (CFG) is a key technique for enhancing prompt alignment.<n>This paper proposes an adaptive CFG scheme that adjusts to different levels of the detected mismatch.<n>Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model.
Score: 17.057147378617422
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While Text-to-Speech (TTS) systems can achieve fine-grained control over emotional expression via natural language prompts, a significant challenge emerges when the desired emotion (style prompt) conflicts with the semantic content of the text. This mismatch often results in unnatural-sounding speech, undermining the goal of achieving fine-grained emotional control. Classifier-Free Guidance (CFG) is a key technique for enhancing prompt alignment; however, its application to auto-regressive (AR) TTS models remains underexplored, which can lead to degraded audio quality. This paper directly addresses the challenge of style-content mismatch in AR TTS models by proposing an adaptive CFG scheme that adjusts to different levels of the detected mismatch, as measured using large language models or natural language inference models. This solution is based on a comprehensive analysis of CFG's impact on emotional expressiveness in state-of-the-art AR TTS models. Our results demonstrate that the proposed adaptive CFG scheme improves the emotional expressiveness of the AR TTS model while maintaining audio quality and intelligibility.

Related papers

CoCoEmo: Composable and Controllable Human-Like Emotional TTS via Activation Steering [25.10244503397448]
Emotional expression in human speech is nuanced and compositional, often involving multiple, conflicting, affective cues.<n>Most expressive text-to-speech systems enforce a single utterance-level emotion, collapsing affective diversity and suppressing mixed or text-emotion-misaligned expression.<n>This paper introduces a quantitative, controllable steering framework, and multi-rater evaluation protocols that enable composable mixed-emotion synthesis and reliable text-emotion mismatch synthesis.
arXiv Detail & Related papers (2026-02-03T11:45:00Z)
IndexTTS2: A Breakthrough in Emotionally Expressive and Duration-Controlled Auto-Regressive Zero-Shot Text-to-Speech [13.3878636941081]
IndexTTS2 is a novel, general, and autoregressive model-friendly method for speech duration control.<n>It achieves disentanglement between emotional expression and speaker identity, enabling independent control over timbre and emotion.<n>It outperforms state-of-the-art zero-shot TTS models in terms of word error rate, speaker similarity, and emotional fidelity.
arXiv Detail & Related papers (2025-06-23T08:33:40Z)
SUTA-LM: Bridging Test-Time Adaptation and Language Model Rescoring for Robust ASR [58.31068047426522]
Test-Time Adaptation (TTA) aims to mitigate by adjusting models during inference.<n>Recent work explores combining TTA with external language models, using techniques like beam search rescoring or generative error correction.<n>We propose SUTA-LM, a simple yet effective extension of SUTA, with language model rescoring.<n> Experiments on 18 diverse ASR datasets show that SUTA-LM achieves robust results across a wide range of domains.
arXiv Detail & Related papers (2025-06-10T02:50:20Z)
Towards Emotionally Consistent Text-Based Speech Editing: Introducing EmoCorrector and The ECD-TSE Dataset [52.95197015472105]
EmoCorrector is a novel post-correction scheme for text-based speech editing.<n>It retrieves the edited text's emotional features, retrieving speech samples with matching emotions, and synthesizing speech that aligns with the desired emotion.<n>EmoCorrector significantly enhances the expression of intended emotion while addressing emotion inconsistency limitations in current TSE methods.
arXiv Detail & Related papers (2025-05-24T16:10:56Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
PROEMO: Prompt-Driven Text-to-Speech Synthesis Based on Emotion and Intensity Control [20.873353104077857]
We introduce an approach centered on prompt-based emotion control.<n>The proposed architecture incorporates emotion and intensity control across multi-speakers.<n>We leverage large language models (LLMs) to manipulate speech prosody while preserving linguistic content.
arXiv Detail & Related papers (2025-01-10T12:10:30Z)
StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [13.713209707407712]
StyleSpeech is a novel Text-to-Speech(TTS) system that enhances the naturalness and accuracy of synthesized speech. Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features. LoRA allows efficient adaptation of style features in pre-trained models.
arXiv Detail & Related papers (2024-08-27T00:37:07Z)
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis [33.909582975045545]
We propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model.
arXiv Detail & Related papers (2024-06-04T06:43:34Z)
UMETTS: A Unified Framework for Emotional Text-to-Speech Synthesis with Multimodal Prompts [64.02363948840333]
UMETTS is a novel framework that leverages emotional cues from multiple modalities to generate highly expressive and emotionally resonant speech.<n>EP-Align employs contrastive learning to align emotional features across text, audio, and visual modalities, ensuring a coherent fusion of multimodal information.<n>EMI-TTS integrates the aligned emotional embeddings with state-of-the-art TTS models to synthesize speech that accurately reflects the intended emotions.
arXiv Detail & Related papers (2024-04-29T03:19:39Z)
A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness [0.0]
The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.
arXiv Detail & Related papers (2023-11-17T13:07:00Z)
ZET-Speech: Zero-shot adaptive Emotion-controllable Text-to-Speech Synthesis with Diffusion and Style-based Models [83.07390037152963]
ZET-Speech is a zero-shot adaptive emotion-controllable TTS model. It allows users to synthesize any speaker's emotional speech using only a short, neutral speech segment and the target emotion label. Experimental results demonstrate that ZET-Speech successfully synthesizes natural and emotional speech with the desired emotion for both seen and unseen speakers.
arXiv Detail & Related papers (2023-05-23T08:52:00Z)
Any-speaker Adaptive Text-To-Speech Synthesis with Diffusion Models [65.28001444321465]
Grad-StyleSpeech is an any-speaker adaptive TTS framework based on a diffusion model. It can generate highly natural speech with extremely high similarity to target speakers' voice, given a few seconds of reference speech. It significantly outperforms speaker-adaptive TTS baselines on English benchmarks.
arXiv Detail & Related papers (2022-11-17T07:17:24Z)
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings. We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.