Fugu-MT 論文翻訳(概要): MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

論文の概要: MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

arxiv url: http://arxiv.org/abs/2511.12074v2
Date: Wed, 19 Nov 2025 14:50:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-20 13:41:21.093104
Title: MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement
Title（参考訳）: MF-Speech:因子アンタングルによる音声生成における微粒化と構成制御の実現
Authors: Xinyue Yu, Youqing Fang, Pingyu Wu, Guoyang Ye, Wenbo Zhou, Weiming Zhang, Song Xiao,
Abstract要約: 本稿では,MF-SpeechEncoderとMF-SpeechGeneratorの2つのコアコンポーネントからなる,MF-Speechと呼ばれる新しいフレームワークを提案する。 MF-SpeechEncoderは、元の音声信号をコンテンツ、音色、感情の非常に純粋な表現に分解するために、多目的最適化戦略を採用する。 MF-SpeechGeneratorは導体として機能し、動的融合と階層的スタイル適応正規化を通じてこれらの因子の精密で構成可能できめ細かい制御を実現する。
参考スコア（独自算出の注目度）: 31.756885606945847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
Abstract（参考訳）: 表現的かつ制御可能な人間の発話の生成は、生成的人工知能のコア目標の1つであるが、その進歩は、音声要因の深い絡み合いと、既存の制御機構の粗い粒度という、2つの根本的な課題によって長い間制約されてきた。これらの課題を克服するため、MF-SpeechEncoderとMF-SpeechGeneratorの2つのコアコンポーネントからなるMF-Speechと呼ばれる新しいフレームワークを提案した。 MF-SpeechEncoderは、元の音声信号をコンテンツ、音色、感情の非常に純粋で独立した表現に分解するために、多目的最適化戦略を採用する。その後、MF-SpeechGeneratorは導体として機能し、動的融合と階層型適応正規化(HSAN)を通じてこれらの因子の精密で構成可能できめ細かな制御を実現する。実験により、MF-Speechは、非常に難しい多要素合成音声生成タスクにおいて、現在の最先端手法よりも優れており、単語誤り率(WER=4.67%)、優れたスタイル制御(SECS=0.5685, Corr=0.68)、最高の主観評価スコア(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78)を実現している。さらに、学習された離散的要因は強い伝達可能性を示し、汎用的な音声表現として有意な可能性を証明している。

論文の概要: MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

関連論文リスト