Fugu-MT 論文翻訳(概要): Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

論文の概要: Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

arxiv url: http://arxiv.org/abs/2510.02848v1
Date: Fri, 03 Oct 2025 09:36:55 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-06 16:35:52.337654
Title: Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech
Title（参考訳）: Flamed-TTS: 効率的なゼロショット音声合成のためのフローマッチング無意図モデル
Authors: Hieu-Nghia Huynh-Nguyen, Huynh Nguyen Dang, Ngoc-Son Nguyen, Van Nguyen,
Abstract要約: Flamed-TTSは、低計算コスト、低レイテンシ、高音声忠実度と豊富な時間的多様性を強調する新しいゼロショットテキスト音声合成フレームワークである。本研究では,Flamed-TTSが最先端モデルを超え,可知性,自然性,話者の類似性,音響特性の保存,動的ペースについて検討した。
参考スコア（独自算出の注目度）: 2.5964779217812057
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Zero-shot Text-to-Speech (TTS) has recently advanced significantly, enabling models to synthesize speech from text using short, limited-context prompts. These prompts serve as voice exemplars, allowing the model to mimic speaker identity, prosody, and other traits without extensive speaker-specific data. Although recent approaches incorporating language models, diffusion, and flow matching have proven their effectiveness in zero-shot TTS, they still encounter challenges such as unreliable synthesis caused by token repetition or unexpected content transfer, along with slow inference and substantial computational overhead. Moreover, temporal diversity-crucial for enhancing the naturalness of synthesized speech-remains largely underexplored. To address these challenges, we propose Flamed-TTS, a novel zero-shot TTS framework that emphasizes low computational cost, low latency, and high speech fidelity alongside rich temporal diversity. To achieve this, we reformulate the flow matching training paradigm and incorporate both discrete and continuous representations corresponding to different attributes of speech. Experimental results demonstrate that Flamed-TTS surpasses state-of-the-art models in terms of intelligibility, naturalness, speaker similarity, acoustic characteristics preservation, and dynamic pace. Notably, Flamed-TTS achieves the best WER of 4% compared to the leading zero-shot TTS baselines, while maintaining low latency in inference and high fidelity in generated speech. Code and audio samples are available at our demo page https://flamed-tts.github.io.
Abstract（参考訳）: Zero-shot Text-to-Speech (TTS)は、近ごろ大幅に進歩し、短い限定されたテキストプロンプトを使用して、テキストから音声を合成することができるようになった。これらのプロンプトは音声の見本として機能し、モデルが話者のアイデンティティ、韻律、その他の特徴を、広範な話者固有のデータなしで模倣することができる。言語モデル、拡散、フローマッチングを取り入れた最近のアプローチは、ゼロショットTSにおいてその効果が証明されているが、トークンの繰り返しや予期せぬコンテンツ転送によって生じる信頼性の低い合成や、推論の遅さや計算上のオーバーヘッドといった課題に直面している。さらに, 合成音声の自然性を高めるための時間的多様性調査は, ほとんど調査されていない。これらの課題に対処するためにFlamed-TTSを提案する。Flamed-TTSは低計算コスト,低レイテンシ,高音声忠実度と時間的多様性を両立させる新しいゼロショットTTSフレームワークである。これを実現するために、フローマッチングトレーニングパラダイムを再構築し、音声の異なる属性に対応する離散表現と連続表現の両方を組み込む。実験結果から,Flamed-TTSは知性,自然性,話者類似性,音響特性の保存,動的ペースの点で最先端モデルを上回ることがわかった。特に、Flamed-TTSは、トップのゼロショットTSベースラインと比較して4%のWERを達成し、推論の低レイテンシと生成された音声の忠実度を維持している。コードとオーディオサンプルは、私たちのデモページ https://flamed-tts.github.io.com で公開されている。

論文の概要: Flamed-TTS: Flow Matching Attention-Free Models for Efficient Generating and Dynamic Pacing Zero-shot Text-to-Speech

関連論文リスト