Fugu-MT 論文翻訳(概要): ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

論文の概要: ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

arxiv url: http://arxiv.org/abs/2510.10774v1
Date: Sun, 12 Oct 2025 19:33:11 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:30.101107
Title: ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis
Title（参考訳）: ParsVoice: テキスト音声合成のための大規模多話者ペルシャ音声コーパス
Authors: Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery,
Abstract要約: 既存のペルシア語のデータセットは、典型的には英語のデータセットよりも小さい。 ParsVoice はペルシャ語で最大の音声コーパスで、特に音声による音声合成のために設計された。パイプラインは2,000のオーディオブックを処理し、3,526時間のクリーン音声を生成する。 ParsVoiceは、主要な英語コーパスに匹敵する話者の多様性とオーディオ品質を提供する、ペルシア語音声データセット最大である。
参考スコア（独自算出の注目度）: 3.763275651955603
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Persian Language, despite being spoken by over 100 million people worldwide, remains severely underrepresented in high-quality speech corpora, particularly for text-to-speech (TTS) synthesis applications. Existing Persian speech datasets are typically smaller than their English counterparts, which creates a key limitation for developing Persian speech technologies. We address this gap by introducing ParsVoice, the largest Persian speech corpus designed specifically for TTS applications. We created an automated pipeline that transforms raw audiobook content into TTS-ready data, incorporating components such as a BERT-based sentence completion detector, a binary search boundary optimization method for precise audio-text alignment, and multi-dimensional quality assessment frameworks tailored to Persian. The pipeline processes 2,000 audiobooks, yielding 3,526 hours of clean speech, which was further filtered into a 1,804-hour high-quality subset suitable for TTS, featuring more than 470 speakers. ParsVoice is the largest high-quality Persian speech dataset, offering speaker diversity and audio quality comparable to major English corpora. The complete dataset has been made publicly available to accelerate the development of Persian speech technologies and to serve as a template for other low-resource languages. The ParsVoice dataset is publicly available at ParsVoice (https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice).
Abstract（参考訳）: ペルシャ語は世界中で1億人以上の人々によって話されているにもかかわらず、高品質な音声コーパス、特にテキスト音声合成(TTS)の用途では、依然として過小評価されている。既存のペルシア語音声データセットは通常、英語のデータセットよりも小さく、ペルシア語音声技術の発達に重要な制限が生じる。このギャップに対処するために、TTS専用に設計されたペルシャ最大の音声コーパスであるParsVoiceを導入する。我々は、生オーディオブックコンテンツをTS対応データに変換する自動パイプラインを作成し、BERTベースの文補完検出器、正確な音声テキストアライメントのためのバイナリ検索境界最適化手法、ペルシア語に合わせた多次元品質評価フレームワークなどのコンポーネントを組み込んだ。パイプラインは2,000のオーディオブックを処理し、3,526時間のクリーンな音声を出力し、さらに470人以上のスピーカーを備えたTSに適した1,804時間の高品質なサブセットにフィルタされた。 ParsVoiceは、主要な英語コーパスに匹敵する話者の多様性とオーディオ品質を提供する、ペルシア語音声データセット最大である。完全なデータセットは、ペルシア語音声技術の発展を加速し、他の低リソース言語のためのテンプレートとして機能するために公開されている。 ParsVoiceデータセットはParsVoice(https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice)で公開されている。

論文の概要: ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

関連論文リスト