HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis
- URL: http://arxiv.org/abs/2509.25842v1
- Date: Tue, 30 Sep 2025 06:31:12 GMT
- Title: HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis
- Authors: Ziyu Zhang, Hanzhao Li, Jingbin Hu, Wenhao Li, Lei Xie,
- Abstract summary: Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes.<n>We propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts.
- Score: 17.743822016045446
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.
Related papers
- ReStyle-TTS: Relative and Continuous Style Control for Zero-Shot Speech Synthesis [35.41874154907003]
Zero-shot text-to-speech models can clone a speaker's timbre from a short reference audio, but they also strongly inherit the speaking style present in the reference.<n>We propose ReStyle-TTS, a framework that enables continuous and reference-relative style control in zero-shot TTS.
arXiv Detail & Related papers (2026-01-07T06:23:23Z) - VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions [66.93932684284695]
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation.<n>We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style.<n>We present VStyle, a benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy.<n>We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness.
arXiv Detail & Related papers (2025-09-09T14:28:58Z) - AutoStyle-TTS: Retrieval-Augmented Generation based Automatic Style Matching Text-to-Speech Synthesis [19.141058309358424]
This study proposes a text-to-speech (TTS) framework based on Retrieval-Augmented Generation (RAG) technology.<n>We have constructed a speech style knowledge database containing high-quality speech samples in various contexts.<n>This scheme uses embeddings, extracted by Llama, PER-LLM-Embedder,and Moka, to match with samples in the knowledge database, selecting the most appropriate speech style for synthesis.
arXiv Detail & Related papers (2025-04-14T15:18:59Z) - StyleSpeech: Parameter-efficient Fine Tuning for Pre-trained Controllable Text-to-Speech [13.713209707407712]
StyleSpeech is a novel Text-to-Speech(TTS) system that enhances the naturalness and accuracy of synthesized speech.
Building upon existing TTS technologies, StyleSpeech incorporates a unique Style Decorator structure that enables deep learning models to simultaneously learn style and phoneme features.
LoRA allows efficient adaptation of style features in pre-trained models.
arXiv Detail & Related papers (2024-08-27T00:37:07Z) - ControlSpeech: Towards Simultaneous and Independent Zero-shot Speaker Cloning and Zero-shot Language Style Control [50.27383290553548]
ControlSpeech is a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style.<n>We show that ControlSpeech exhibits comparable or state-of-the-art (SOTA) performance in terms of controllability, timbre similarity, audio quality, robustness, and generalizability.
arXiv Detail & Related papers (2024-06-03T11:15:16Z) - StyleSpeech: Self-supervised Style Enhancing with VQ-VAE-based
Pre-training for Expressive Audiobook Speech Synthesis [63.019962126807116]
The expressive quality of synthesized speech for audiobooks is limited by generalized model architecture and unbalanced style distribution.
We propose a self-supervised style enhancing method with VQ-VAE-based pre-training for expressive audiobook speech synthesis.
arXiv Detail & Related papers (2023-12-19T14:13:26Z) - Expressive TTS Driven by Natural Language Prompts Using Few Human
Annotations [12.891344121936902]
Expressive text-to-speech (TTS) aims to synthesize speeches with human-like tones, moods, or even artistic attributes.
Recent advancements in TTS empower users with the ability to directly control synthesis style through natural language prompts.
We present FreeStyleTTS (FS-TTS), a controllable expressive TTS model with minimal human annotations.
arXiv Detail & Related papers (2023-11-02T14:20:37Z) - ParaGuide: Guided Diffusion Paraphrasers for Plug-and-Play Textual Style
Transfer [57.6482608202409]
Textual style transfer is the task of transforming stylistic properties of text while preserving meaning.
We introduce a novel diffusion-based framework for general-purpose style transfer that can be flexibly adapted to arbitrary target styles.
We validate the method on the Enron Email Corpus, with both human and automatic evaluations, and find that it outperforms strong baselines on formality, sentiment, and even authorship style transfer.
arXiv Detail & Related papers (2023-08-29T17:36:02Z) - TextrolSpeech: A Text Style Control Speech Corpus With Codec Language
Text-to-Speech Models [51.529485094900934]
We propose TextrolSpeech, which is the first large-scale speech emotion dataset annotated with rich text attributes.
We introduce a multi-stage prompt programming approach that effectively utilizes the GPT model for generating natural style descriptions in large volumes.
To address the need for generating audio with greater style diversity, we propose an efficient architecture called Salle.
arXiv Detail & Related papers (2023-08-28T09:06:32Z) - GestureDiffuCLIP: Gesture Diffusion Model with CLIP Latents [3.229105662984031]
GestureDiffuCLIP is a neural network framework for synthesizing realistic, stylized co-speech gestures with flexible style control.
Our system learns a latent diffusion model to generate high-quality gestures and infuses the CLIP representations of style into the generator.
Our system can be extended to allow fine-grained style control of individual body parts.
arXiv Detail & Related papers (2023-03-26T03:35:46Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.