Fugu-MT 論文翻訳(概要): Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

論文の概要: Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

arxiv url: http://arxiv.org/abs/2606.21882v1
Date: Sat, 20 Jun 2026 04:47:45 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-26 02:27:16.134888
Title: Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead
Title（参考訳）: 限られたルックアヘッドを用いたT5テキスト音声合成
Authors: Muyang Du, Jason Roche, Junjie Lai,
Abstract要約: 本稿では,低レイテンシで単語単位のインクリメンタル音声合成が可能なT5-TTSのストリーミング版であるS5-TTSを提案する。 S5-TTSは、最初の数ワードを受信した直後に音声を生成し始め、エンドツーエンドの応答遅延を大幅に低減する。実験によると、S5-TTSはフルコンテキストのT5-TTSに匹敵する品質を実現し、高い話者類似性を持つゼロショット合成をサポートし、実用的なAIシステムのエンドツーエンドレイテンシを著しく低減している。
参考スコア（独自算出の注目度）: 4.740962650068887
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Streaming text-to-speech synthesis in cascaded LLM-TTS systems still faces latency challenges as most TTS models require full context before initiating generation. We present S5-TTS, a streaming variant of T5-TTS that enables low-latency, word-by-word incremental speech synthesis through encoder-decoder language modeling and monotonic alignment learning. S5-TTS begins generating speech immediately after receiving the first few words, substantially reducing end-to-end response latency. To maintain quality under limited lookahead, we introduce a lookahead-causal masking mechanism with Conv-based auxiliary attention that preserves intelligibility and speaker similarity, and employ interleaved multi-source distillation to further restore naturalness. Experiments show that S5-TTS achieves comparable quality to full-context T5-TTS, supports zero-shot synthesis with high speaker similarity, and significantly reduces end-to-end latency for practical conversational AI systems.
Abstract（参考訳）: LLM-TTSシステムにおけるテキスト音声合成のストリーミングは、ほとんどのTSモデルでは生成を開始する前に完全なコンテキストを必要とするため、まだレイテンシの問題に直面している。我々は、エンコーダ・デコーダ言語モデリングと単調アライメント学習により、低レイテンシで単語ごとのインクリメンタル音声合成を可能にするT5-TTSのストリーミング版であるS5-TTSを提案する。 S5-TTSは、最初の数ワードを受信した直後に音声を生成し始め、エンドツーエンドの応答遅延を大幅に低減する。限定的なルックアヘッド下で品質を維持するため,コンブをベースとした補助的注意力を備えたルックアヘッド・カウサルマスキング機構を導入し,自然性を更に回復させるためにインターリーブ型多ソース蒸留を用いた。実験によると、S5-TTSはフルコンテキストのT5-TTSに匹敵する品質を実現し、高い話者類似性を持つゼロショット合成をサポートし、実用的な対話型AIシステムにおけるエンドツーエンドのレイテンシを大幅に低減する。

論文の概要: Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead

関連論文リスト