Fugu-MT 論文翻訳(概要): UniVoice: A Unified Model for Speech and Singing Voice Generation

論文の概要: UniVoice: A Unified Model for Speech and Singing Voice Generation

arxiv url: http://arxiv.org/abs/2606.05852v1
Date: Thu, 04 Jun 2026 08:27:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-05 22:39:44.646832
Title: UniVoice: A Unified Model for Speech and Singing Voice Generation
Title（参考訳）: UniVoice: 音声および歌声生成のための統一モデル
Authors: Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen,
Abstract要約: 条件付きフローマッチングに基づく統一音声・歌唱音声生成フレームワークUniVoiceを提案する。歌唱では、メロディ条件はMIDIノートシーケンスで表され、音声では、学習されたヌルメロディトークンに置き換えられる。 UniVoiceは、F5-TTS (5.21%)やCosyVoice3 (5.30%)のような専用のTSSシステムに匹敵する5.26%のスピーチPERを達成する
参考スコア（独自算出の注目度）: 11.813888122703302
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).
Abstract（参考訳）: テキスト音声合成(TTS)と歌声合成(SVS)はどちらも記号入力から人間の声の音声を生成することを目的としているが、生成プロセスに異なる要件を課している。音声生成は柔軟で言語駆動の韻律に依存し、歌声生成には明確な旋律制御と正確なリズムアライメントが必要である。このミスマッチは、メロディに関連した条件は歌を強く制約するが、音声の韻律を制限するべきではないため、自然な音声と制御可能な歌の両方を生成することができる単一モデルの訓練を困難にしている。条件付きフローマッチングに基づく統一音声・歌唱音声生成フレームワークUniVoiceを提案する。 UniVoiceは、単一の未分化条件表現を使用する代わりに、条件をコンテンツ、メロディ、音色に分解し、モダリティに適合するエンコーダで符号化され、共有拡散変換器(DiT)のバックボーンで消費される。歌唱では、メロディ条件はMIDIノートシーケンスで表され、音声では、学習されたヌルなメロディトークンに置き換えられ、モデルが言語的・音響的文脈から韻律を推測することができる。この設計は、音声にメロディ制約を課す必要を回避しつつ、歌唱のための明確なメロディ制御を保っている。さらに, 条件流中におけるメロディ境界化の近似として, ヌルメロディトークンを解析した。 30k時間のスピーチと35k時間の歌唱データで訓練されたUniVoiceは、F5-TTS (5.21\%)やCosyVoice3 (5.30\%)のような専用のTSSシステムに匹敵する5.26\%の音声PERを達成している。歌声生成において、UniVoiceは16.22\%のPERを達成し、統一されたベースラインVevo1.5(24.72\%)を上回っている。

論文の概要: UniVoice: A Unified Model for Speech and Singing Voice Generation

関連論文リスト