Related papers: ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis

URL: http://arxiv.org/abs/2601.03632v1
Date: Wed, 07 Jan 2026 06:23:23 GMT
Title: ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis
Authors: Haitao Li, Chunxiang Jin, Chenglin Li, Wenhao Guan, Zhengxing Huang, Xie Chen,
Abstract summary: Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference.<n>We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS.
Score: 35.41874154907003
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference. As a result, synthesizing speech with a desired style often requires carefully selecting reference audio, which is impractical when only limited or mismatched references are available. While recent controllable TTS methods attempt to address this issue, they typically rely on absolute style targets and discrete textual prompts, and therefore do not support continuous and reference-relative style control. We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS. Our key insight is that effective style control requires first reducing the model's implicit dependence on reference style before introducing explicit control mechanisms. To this end, we introduce Decoupled Classifier-Free Guidance (DCFG), which independently controls text and reference guidance, reducing reliance on reference style while preserving text fidelity. On top of this, we apply style-specific LoRAs together with Orthogonal LoRA Fusion to enable continuous and disentangled multi-attribute control, and introduce a Timbre Consistency Optimization module to mitigate timbre drift caused by weakened reference guidance. Experiments show that ReStyle-TTS enables user-friendly, continuous, and relative control over pitch, energy, and multiple emotions while maintaining intelligibility and speaker timbre, and performs robustly in challenging mismatched reference-target style scenarios.

Related papers

HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis [17.743822016045446]
Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes.<n>We propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts.
arXiv Detail & Related papers (2025-09-30T06:31:12Z)
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for Zero-Shot Speech Synthesis [56.25862714128288]
This paper introduces textitMegaTTS 3, a zero-shot text-to-speech (TTS) system featuring an innovative sparse alignment algorithm.<n>Specifically, we provide sparse alignment boundaries to MegaTTS 3 to reduce the difficulty of alignment without limiting the search space.<n>Experiments demonstrate that MegaTTS 3 achieves state-of-the-art zero-shot TTS speech quality and supports highly flexible control over accent intensity.
arXiv Detail & Related papers (2025-02-26T08:22:00Z)
ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control [50.27383290553548]
ControlSpeech is a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style.<n>We show that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability.
arXiv Detail & Related papers (2024-06-03T11:15:16Z)
Expressive TTS Driven by Natural Language Prompts Using Few Human Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes. Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts. We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z)
Pruning Self-Attention for Zero-Shot Multi-Speaker Text-to-Speech [26.533600745910437]
We propose an effective pruning method for a transformer known as sparse attention, to improve the TTS model's generalization abilities. We also propose a new differentiable pruning method that allows the model to automatically learn the thresholds.
arXiv Detail & Related papers (2023-08-28T21:25:05Z)
Controllable speech synthesis by learning discrete phoneme-level prosodic representations [53.926969174260705]
We present a novel method for phoneme-level prosody control of F0 and duration using intuitive discrete labels. We propose an unsupervised prosodic clustering process which is used to discretize phoneme-level F0 and duration features from a multispeaker speech dataset.
arXiv Detail & Related papers (2022-11-29T15:43:36Z)
Prosodic Clustering for Phoneme-level Prosody Control in End-to-End Speech Synthesis [49.6007376399981]
We present a method for controlling the prosody at the phoneme level in an autoregressive attention-based text-to-speech system. The proposed method retains the high quality of generated speech, while allowing phoneme-level control of F0 and duration. By replacing the F0 cluster centroids with musical notes, the model can also provide control over the note and octave within the range of the speaker.
arXiv Detail & Related papers (2021-11-19T12:10:16Z)
Fine-grained style control in Transformer-based Text-to-speech Synthesis [78.92428622630861]
We present a novel architecture to realize fine-grained style control on the Transformer-based text-to-speech synthesis (TransformerTTS) We model the speaking style by extracting a time sequence of local style tokens (LST) from the reference speech. Experiments show that with fine-grained style control, our system performs better in terms of naturalness, intelligibility, and style transferability.
arXiv Detail & Related papers (2021-10-12T19:50:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.