Related papers: RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning

URL: http://arxiv.org/abs/2507.08012v1
Date: Sat, 05 Jul 2025 10:59:00 GMT
Title: RepeaTTS: Towards Feature Discovery through Repeated Fine-Tuning
Authors: Atli Sigurgeirsson, Simon King,
Abstract summary: A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction.<n>Control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics.<n>We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model.
Score: 11.547937373256921
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A Prompt-based Text-To-Speech model allows a user to control different aspects of speech, such as speaking rate and perceived gender, through natural language instruction. Although user-friendly, such approaches are on one hand constrained: control is limited to acoustic features exposed to the model during training, and too flexible on the other: the same inputs yields uncontrollable variation that are reflected in the corpus statistics. We investigate a novel fine-tuning regime to address both of these issues at the same time by exploiting the uncontrollable variance of the model. Through principal component analysis of thousands of synthesised samples, we determine latent features that account for the highest proportion of the output variance and incorporate them as new labels for secondary fine-tuning. We evaluate the proposed methods on two models trained on an expressive Icelandic speech corpus, one with emotional disclosure and one without. In the case of the model without emotional disclosure, the method yields both continuous and discrete features that improve overall controllability of the model.

Related papers

Investigating Stochastic Methods for Prosody Modeling in Speech Synthesis [59.68341193176134]
generative methods have progressed rapidly in recent years, but generating expressive prosody for an utterance remains a challenging task.<n>In this work, we investigate the effectiveness of methods for this task, including Normalizing Flows, Conditional Flow Matching, and Rectified Flows.<n>Our extensive subjective and objective evaluations demonstrate that methods produce natural prosody on par with human speakers by capturing the inherent variability in human speech.
arXiv Detail & Related papers (2025-06-30T19:52:32Z)
Contextual Biasing to Improve Domain-specific Custom Vocabulary Audio Transcription without Explicit Fine-Tuning of Whisper Model [0.0]
OpenAI's Whisper Automated Speech Recognition model excels in generalizing across diverse datasets and domains. We propose a method to enhance transcription accuracy without explicit fine-tuning or altering model parameters.
arXiv Detail & Related papers (2024-10-24T01:58:11Z)
High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models [56.00939852727501]
Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations. Non-autoregressive framework enhances controllability, and duration diffusion model enables diversified prosodic expression.
arXiv Detail & Related papers (2023-09-27T09:27:03Z)
ContraFeat: Contrasting Deep Features for Semantic Discovery [102.4163768995288]
StyleGAN has shown strong potential for disentangled semantic control. Existing semantic discovery methods on StyleGAN rely on manual selection of modified latent layers to obtain satisfactory manipulation results. We propose a model that automates this process and achieves state-of-the-art semantic discovery performance.
arXiv Detail & Related papers (2022-12-14T15:22:13Z)
Disentangling Generative Factors in Natural Language with Discrete Variational Autoencoders [0.0]
We argue that continuous variables may not be ideal to model features of textual data, due to the fact that most generative factors in text are discrete. We propose a Variational Autoencoder based method which models language features as discrete variables and encourages independence between variables for learning disentangled representations.
arXiv Detail & Related papers (2021-09-15T09:10:05Z)
Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis [68.76620947298595]
Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. We propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody.
arXiv Detail & Related papers (2021-06-15T18:03:48Z)
On the Lack of Robust Interpretability of Neural Text Classifiers [14.685352584216757]
We assess the robustness of interpretations of neural text classifiers based on pretrained Transformer encoders. Both tests show surprising deviations from expected behavior, raising questions about the extent of insights that practitioners may draw from interpretations.
arXiv Detail & Related papers (2021-06-08T18:31:02Z)
Empowering Language Understanding with Counterfactual Reasoning [141.48592718583245]
We propose a Counterfactual Reasoning Model, which mimics the counterfactual thinking by learning from few counterfactual samples. In particular, we devise a generation module to generate representative counterfactual samples for each factual sample, and a retrospective module to retrospect the model prediction by comparing the counterfactual and factual samples.
arXiv Detail & Related papers (2021-06-06T06:36:52Z)
Statistical Context-Dependent Units Boundary Correction for Corpus-based Unit-Selection Text-to-Speech [1.4337588659482519]
We present an innovative technique for speaker adaptation in order to improve the accuracy of segmentation with application to unit-selection Text-To-Speech (TTS) systems. Unlike conventional techniques for speaker adaptation, we aim to use only context dependent characteristics extrapolated with linguistic analysis techniques.
arXiv Detail & Related papers (2020-03-05T12:42:13Z)
Generating diverse and natural text-to-speech samples using a quantized fine-grained VAE and auto-regressive prosody prior [53.69310441063162]
This paper proposes a sequential prior in a discrete latent space which can generate more naturally sounding samples. We evaluate the approach using listening tests, objective metrics of automatic speech recognition (ASR) performance, and measurements of prosody attributes.
arXiv Detail & Related papers (2020-02-06T12:35:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.