Fugu-MT 論文翻訳(概要): DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

論文の概要: DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

arxiv url: http://arxiv.org/abs/2509.09631v2
Date: Fri, 12 Sep 2025 01:59:18 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-15 12:05:48.676758
Title: DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech
Title（参考訳）: DiFlow-TTS:低レイテンシゼロショット音声合成のための分解音声トークンとの離散フローマッチング
Authors: Ngoc-Son Nguyen, Hieu-Nghia Huynh-Nguyen, Thanh V. T. Tran, Truong-Son Hy, Van Nguyen,
Abstract要約: Zero-shot Text-to-Speech (TTS) は、短い参照サンプルのみを用いて、目に見えない話者の声を模倣する高品質な音声を合成することを目的としている。言語モデル,拡散,フローマッチングに基づく最近のアプローチは,ゼロショットTSにおいて有望な結果を示しているが,それでも遅い推論と繰り返しアーティファクトに悩まされている。音声合成のための純粋離散フローマッチングを探索する最初のモデルであるDiFlow-TTSを紹介する。
参考スコア（独自算出の注目度）: 8.537791317883576
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot Text-to-Speech (TTS) aims to synthesize high-quality speech that mimics the voice of an unseen speaker using only a short reference sample, requiring not only speaker adaptation but also accurate modeling of prosodic attributes. Recent approaches based on language models, diffusion, and flow matching have shown promising results in zero-shot TTS, but still suffer from slow inference and repetition artifacts. Discrete codec representations have been widely adopted for speech synthesis, and recent works have begun to explore diffusion models in purely discrete settings, suggesting the potential of discrete generative modeling for speech synthesis. However, existing flow-matching methods typically embed these discrete tokens into a continuous space and apply continuous flow matching, which may not fully leverage the advantages of discrete representations. To address these challenges, we introduce DiFlow-TTS, which, to the best of our knowledge, is the first model to explore purely Discrete Flow Matching for speech synthesis. DiFlow-TTS explicitly models factorized speech attributes within a compact and unified architecture. It leverages in-context learning by conditioning on textual content, along with prosodic and acoustic attributes extracted from a reference speech, enabling effective attribute cloning in a zero-shot setting. In addition, the model employs a factorized flow prediction mechanism with distinct heads for prosody and acoustic details, allowing it to learn aspect-specific distributions. Experimental results demonstrate that DiFlow-TTS achieves promising performance in several key metrics, including naturalness, prosody, preservation of speaker style, and energy control. It also maintains a compact model size and achieves low-latency inference, generating speech up to 25.8 times faster than the latest existing baselines.
Abstract（参考訳）: Zero-shot Text-to-Speech (TTS) は、話者適応だけでなく、韻律属性の正確なモデリングも必要とし、短い参照サンプルのみを用いて、見知らぬ話者の声を模倣する高品質な音声を合成することを目的としている。言語モデル,拡散,フローマッチングに基づく最近のアプローチは,ゼロショットTSにおいて有望な結果を示しているが,それでも遅い推論と繰り返しアーティファクトに悩まされている。音声合成には離散コーデック表現が広く採用されており、近年の研究では純粋に離散的な環境下での拡散モデルの検討が始まっており、音声合成のための離散生成モデルの可能性が示唆されている。しかし、既存のフローマッチング法は通常、これらの離散トークンを連続した空間に埋め込み、連続的なフローマッチングを適用する。これらの課題に対処するため,我々はDiFlow-TTSを導入し,音声合成のために純粋に離散的なフローマッチングを探索する最初のモデルとなった。 DiFlow-TTSは、コンパクトで統一されたアーキテクチャにおいて、因子化された音声属性を明示的にモデル化する。テキスト内容の条件付けによるテキスト内学習と、参照音声から抽出された韻律的・音響的属性を活用し、ゼロショット設定での効果的な属性クローニングを可能にする。さらに、このモデルは、韻律や音響的詳細のために異なる頭部を持つ分解フロー予測機構を採用しており、アスペクト固有の分布を学習することができる。実験結果から,DiFlow-TTSは自然性,韻律,話者スタイルの保存,エネルギー制御など,いくつかの重要な指標において有望な性能を発揮することが示された。また、コンパクトなモデルサイズを維持し、低レイテンシ推論を実現し、最新のベースラインの25.8倍の速さで音声を生成する。

論文の概要: DiFlow-TTS: Discrete Flow Matching with Factorized Speech Tokens for Low-Latency Zero-Shot Text-To-Speech

関連論文リスト