Related papers: Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models

URL: http://arxiv.org/abs/2506.00832v1
Date: Sun, 01 Jun 2025 04:33:37 GMT
Title: Counterfactual Activation Editing for Post-hoc Prosody and Mispronunciation Correction in TTS Models
Authors: Kyowoon Lee, Artyom Stitsyuk, Gunu Jho, Inchul Hwang, Jaesik Choi,
Abstract summary: Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments.<n>We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation.<n> Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality.
Score: 19.852233854729235
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in Text-to-Speech (TTS) have significantly improved speech naturalness, increasing the demand for precise prosody control and mispronunciation correction. Existing approaches for prosody manipulation often depend on specialized modules or additional training, limiting their capacity for post-hoc adjustments. Similarly, traditional mispronunciation correction relies on grapheme-to-phoneme dictionaries, making it less practical in low-resource settings. We introduce Counterfactual Activation Editing, a model-agnostic method that manipulates internal representations in a pre-trained TTS model to achieve post-hoc control of prosody and pronunciation. Experimental results show that our method effectively adjusts prosodic features and corrects mispronunciations while preserving synthesis quality. This opens the door to inference-time refinement of TTS outputs without retraining, bridging the gap between pre-trained TTS models and editable speech synthesis.

Related papers

SonoEdit: Null-Space Constrained Knowledge Editing for Pronunciation Correction in LLM-Based TTS [1.392548092257887]
We introduce SonoEdit, a model editing technique that surgically corrects pronunciation errors in pre-trained TTS models without retraining.<n>Instead of costly finetuning or explicit phoneme injection, we propose a parsimonious alternative based on Null-Space Pronunciation Editing.
arXiv Detail & Related papers (2026-01-23T08:40:49Z)
Adaptive Duration Model for Text Speech Alignment [2.594813802197567]
Speech-to-text alignment is a critical component of neural text to speech (TTS) models.<n>We propose a novel duration prediction framework that can give promising phoneme-level duration distribution with given text.
arXiv Detail & Related papers (2025-07-30T12:31:11Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
Aligner-Guided Training Paradigm: Advancing Text-to-Speech Models with Aligner Guided Duration [13.713209707407712]
We propose a novel Aligner-Guided Training Paradigm that prioritizes accurate duration labelling by training an aligner before the TTS model.<n>Our experimental results show that aligner-guided duration labelling can achieve up to a 16% improvement in word error rate and significantly enhance phoneme and tone alignment.
arXiv Detail & Related papers (2024-12-11T05:39:12Z)
Phonetic Enhanced Language Modeling for Text-to-Speech Synthesis [33.909582975045545]
We propose a phonetic enhanced language modeling method to improve the performance of TTS models. We leverage self-supervised representations that are phonetically rich as the training target for the autoregressive language model.
arXiv Detail & Related papers (2024-06-04T06:43:34Z)
A Study on Altering the Latent Space of Pretrained Text to Speech Models for Improved Expressiveness [0.0]
The paper identifies the challenges encountered when working with a VAE-based TTS model and evaluates different image-to-image methods for altering latent speech features. Our results offer valuable insights into the complexities of adding expressiveness control to TTS systems and open avenues for future research in this direction.
arXiv Detail & Related papers (2023-11-17T13:07:00Z)
Parameter-Efficient Learning for Text-to-Speech Accent Adaptation [58.356667204518985]
This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS) A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2% to 0.8% of original trainable parameters. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning.
arXiv Detail & Related papers (2023-05-18T22:02:59Z)
ParrotTTS: Text-to-Speech synthesis by exploiting self-supervised representations [27.157701195636477]
ParrotTTS is a modularized text-to-speech synthesis model. It can train a multi-speaker variant effectively using transcripts from a single speaker. It adapts to a new language in low resource setup and generalizes to languages not seen while training the self-supervised backbone.
arXiv Detail & Related papers (2023-03-01T17:23:12Z)
Revisiting End-to-End Speech-to-Text Translation From Scratch [48.203394370942505]
End-to-end (E2E) speech-to-text translation (ST) often depends on pretraining its encoder and/or decoder using source transcripts via speech recognition or text translation tasks. In this paper, we explore the extent to which the quality of E2E ST trained on speech-translation pairs alone can be improved.
arXiv Detail & Related papers (2022-06-09T15:39:19Z)
An Exploration of Prompt Tuning on Generative Spoken Language Model for Speech Processing Tasks [112.1942546460814]
We report the first exploration of the prompt tuning paradigm for speech processing tasks based on Generative Spoken Language Model (GSLM) Experiment results show that the prompt tuning technique achieves competitive performance in speech classification tasks with fewer trainable parameters than fine-tuning specialized downstream models.
arXiv Detail & Related papers (2022-03-31T03:26:55Z)
EdiTTS: Score-based Editing for Controllable Text-to-Speech [9.34612743192798]
EdiTTS is an off-the-shelf speech editing methodology based on score-based generative modeling for text-to-speech synthesis. We apply coarse yet deliberate perturbations in the Gaussian prior space to induce desired behavior from the diffusion model. Listening tests demonstrate that EdiTTS is capable of reliably generating natural-sounding audio that satisfies user-imposed requirements.
arXiv Detail & Related papers (2021-10-06T08:51:10Z)
Hierarchical Prosody Modeling for Non-Autoregressive Speech Synthesis [76.39883780990489]
We analyze the behavior of non-autoregressive TTS models under different prosody-modeling settings. We propose a hierarchical architecture, in which the prediction of phoneme-level prosody features are conditioned on the word-level prosody features.
arXiv Detail & Related papers (2020-11-12T16:16:41Z)
Improving Text Generation with Student-Forcing Optimal Transport [122.11881937642401]
We propose using optimal transport (OT) to match the sequences generated in training and testing modes. An extension is also proposed to improve the OT learning, based on the structural and contextual information of the text sequences. The effectiveness of the proposed method is validated on machine translation, text summarization, and text generation tasks.
arXiv Detail & Related papers (2020-10-12T19:42:25Z)
Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody. We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR) We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z)

This list is automatically generated from the titles and abstracts of the papers in this site.