Fugu-MT 論文翻訳(概要): PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

論文の概要: PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

arxiv url: http://arxiv.org/abs/2605.27258v2
Date: Wed, 27 May 2026 06:28:23 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-28 17:38:55.164577
Title: PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis
Title（参考訳）: PilotTTS: 競合音声合成のための分割型モジュールレシピ
Authors: Bowen Li, Shaotong Guo, Zhen Wang, Yang Xiang, Mingli Jin, Yihang Lin, Jiahui Zhao, Weibo Xiong, Dongrui Zhang, Keming Chen, Yunze Gao, Zeyang Lin, Yuze Zhou, Yue Liu,
Abstract要約: 本稿では,軽量自動回帰音声合成システムPilotTTSを提案する。 PilotTTSは,最小限のアーキテクチャと厳格なデータエンジニアリングを通じて,競争力のあるパフォーマンスを実現する。 PilotTTSは、オープンソースツールで完全に処理されたわずか200K時間のデータでトレーニングされている。
参考スコア（独自算出の注目度）: 16.983092337406138
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Building state-of-the-art text-to-speech (TTS) systems typically demands millions of hours of proprietary data and complex multi-stage architectures, creating substantial barriers for resource-constrained research teams. In this report, we present PilotTTS, a lightweight autoregressive TTS system that achieves competitive performance through minimalist architecture and rigorous data engineering. PilotTTS is trained on only 200K hours of data processed entirely with open-source tools. Specifically, our contributions are: (1) a reproducible multi-stage data processing pipeline covering quality assessment, label annotation, and filtering, and (2) a compact model architecture that employs Q-Former-based conditioning to decouple speaker identity from speaking style via cross-sample paired training. Within a unified framework, PilotTTS supports zero-shot voice cloning, emotion synthesis (11 categories), paralinguistic synthesis (4 categories), and Chinese dialect synthesis (14 dialects). On the Seed-TTS Eval benchmark, PilotTTS achieves the lowest WER of 1.50% on test-en, a CER of 0.87% on test-zh, and the highest speaker similarity on both test sets (0.862 and 0.815), outperforming systems trained on significantly larger datasets. We release the complete data pipeline recipe, pretrained weights, and code at https://github.com/AMAPVOICE/PilotTTS.
Abstract（参考訳）: 最先端のテキスト音声(TTS)システムを構築するには、通常数百万時間ものプロプライエタリなデータと複雑なマルチステージアーキテクチャを必要とする。本稿では,最小限のアーキテクチャと厳密なデータエンジニアリングによる競合性能を実現する軽量自動回帰型TSSシステムであるPilotTTSを提案する。 PilotTTSは、オープンソースツールで完全に処理されたわずか200K時間のデータでトレーニングされている。具体的には,(1)品質評価,ラベルアノテーション,フィルタリングを網羅した再現可能な多段階データ処理パイプライン,(2)Q-Formerベースの条件付けを用いたコンパクトなモデルアーキテクチャを用いて,話者識別をクロスサンプルペアリングによる発話スタイルから分離する。統合されたフレームワークの中で、PilotTTSはゼロショット音声のクローニング、感情合成(11カテゴリ)、パラ言語合成(4カテゴリ)、中国方言合成(14方言)をサポートしている。 Seed-TTS Evalベンチマークでは、PilotTTSはテスト-enで1.50%、テスト-zhで0.87%、両方のテストセットで0.862と0.815で最高の話者類似度を達成した。完全なデータパイプラインのレシピ、事前訓練されたウェイト、コードをhttps://github.com/AMAPVOICE/PilotTTS.comでリリースします。

論文の概要: PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis

関連論文リスト