Fugu-MT 論文翻訳(概要): HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

論文の概要: HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

arxiv url: http://arxiv.org/abs/2509.25842v1
Date: Tue, 30 Sep 2025 06:31:12 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-01 14:45:00.04172
Title: HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis
Title（参考訳）: HiStyle:テキストプロンプト誘導可制御音声合成のための階層型埋め込み予測器
Authors: Ziyu Zhang, Hanzhao Li, Jingbin Hu, Wenhao Li, Lei Xie,
Abstract要約: 制御可能な音声合成とは、特定の韻律的・パラ言語的属性を操作することによって、発話スタイルを正確に制御することを指す。テキストのプロンプトに条件付けされたスタイル埋め込みを階層的に予測する2段階型埋め込み予測器であるHiStyleを提案する。
参考スコア（独自算出の注目度）: 17.743822016045446
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Controllable speech synthesis refers to the precise control of speaking style by manipulating specific prosodic and paralinguistic attributes, such as gender, volume, speech rate, pitch, and pitch fluctuation. With the integration of advanced generative models, particularly large language models (LLMs) and diffusion models, controllable text-to-speech (TTS) systems have increasingly transitioned from label-based control to natural language description-based control, which is typically implemented by predicting global style embeddings from textual prompts. However, this straightforward prediction overlooks the underlying distribution of the style embeddings, which may hinder the full potential of controllable TTS systems. In this study, we use t-SNE analysis to visualize and analyze the global style embedding distribution of various mainstream TTS systems, revealing a clear hierarchical clustering pattern: embeddings first cluster by timbre and subsequently subdivide into finer clusters based on style attributes. Based on this observation, we propose HiStyle, a two-stage style embedding predictor that hierarchically predicts style embeddings conditioned on textual prompts, and further incorporate contrastive learning to help align the text and audio embedding spaces. Additionally, we propose a style annotation strategy that leverages the complementary strengths of statistical methodologies and human auditory preferences to generate more accurate and perceptually consistent textual prompts for style control. Comprehensive experiments demonstrate that when applied to the base TTS model, HiStyle achieves significantly better style controllability than alternative style embedding predicting approaches while preserving high speech quality in terms of naturalness and intelligibility. Audio samples are available at https://anonymous.4open.science/w/HiStyle-2517/.
Abstract（参考訳）: 制御可能な音声合成は、性別、音量、発話速度、ピッチ、ピッチ揺らぎといった特定の韻律的・パラ言語的な属性を操作することによって、発話スタイルを正確に制御することを指す。高度な生成モデル、特に大言語モデル(LLM)と拡散モデルの統合により、制御可能なテキスト音声(TTS)システムは、ラベルベースの制御から自然言語による記述に基づく制御へと移行し、典型的には、テキストプロンプトからグローバルなスタイルの埋め込みを予測することで実装されている。しかし、この直接的な予測は、制御可能なTSシステムの潜在能力を妨げかねないスタイル埋め込みの基本的な分布を見落としている。本研究では,T-SNE解析を用いて,様々な主流TSシステムのグローバルなスタイルの埋め込み分布を可視化し,解析し,階層的なクラスタリングパターンを明らかにする。本研究は,テキストのプロンプトに条件付されたスタイル埋め込みを階層的に予測する2段階のスタイル埋め込み予測器であるHiStyleを提案し,テキストとオーディオの埋め込み空間の整合を支援するためにコントラスト学習を取り入れた。さらに,統計的手法と人間の聴覚嗜好の相補的な強みを活用して,より正確で知覚的に一貫性のあるテキストプロンプトを生成するスタイルアノテーション戦略を提案する。総合的な実験により、HiStyleは、基本TSモデルに適用すると、自然性や知性の観点から高い音声品質を保ちながら、代替スタイルの埋め込み予測手法よりもはるかに優れたスタイル制御性が得られることが示された。オーディオサンプルはhttps://anonymous.4open.science/w/HiStyle-2517/で入手できる。

論文の概要: HiStyle: Hierarchical Style Embedding Predictor for Text-Prompt-Guided Controllable Speech Synthesis

関連論文リスト