Fugu-MT 論文翻訳(概要): Tagarela - A Portuguese speech dataset from podcasts

論文の概要: Tagarela - A Portuguese speech dataset from podcasts

arxiv url: http://arxiv.org/abs/2603.15326v1
Date: Mon, 16 Mar 2026 14:18:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.423517
Title: Tagarela - A Portuguese speech dataset from podcasts
Title（参考訳）: Tagarela - ポッドキャストからのポルトガル語音声データセット
Authors: Frederico Santos de Oliveira, Lucas Rafael Stefanel Gris, Alef Iury Siqueira Ferreira, Augusto Seben da Rosa, Alexandre Costa Ferro Filho, Edresson Casanova, Christopher Dane Shulby, Rafael Teixeira Sousa, Diogo Fernandes Costa Silva, Anderson da Silva Soares, Arlindo Rodrigues Galvão Filho,
Abstract要約: 本稿では,8,972時間以上のポッドキャスト音声で構成されたTAGARELAという新しいデータセットを提案する。データ品質を確保するため、コーパスはオーディオ前処理パイプラインの対象となり、その後混合戦略を用いて書き起こされた。この新たな資源の有効性を検証するため,我々のデータセットに特化して訓練されたASRおよびTSモデルを提示し,その性能を評価する。
参考スコア（独自算出の注目度）: 32.28056892535881
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite significant advances in speech processing, Portuguese remains under-resourced due to the scarcity of public, large-scale, and high-quality datasets. To address this gap, we present a new dataset, named TAGARELA, composed of over 8,972 hours of podcast audio, specifically curated for training automatic speech recognition (ASR) and text-to-speech (TTS) models. Notably, its scale rivals English's GigaSpeech (10kh), enabling state-of-the-art Portuguese models. To ensure data quality, the corpus was subjected to an audio pre-processing pipeline and subsequently transcribed using a mixed strategy: we applied ASR models that were previously trained on high-fidelity transcriptions generated by proprietary APIs, ensuring a high level of initial accuracy. Finally, to validate the effectiveness of this new resource, we present ASR and TTS models trained exclusively on our dataset and evaluate their performance, demonstrating its potential to drive the development of more robust and natural speech technologies for Portuguese. The dataset is released publicly, available at https://freds0.github.io/TAGARELA/, to foster the development of robust speech technologies.
Abstract（参考訳）: 音声処理の大幅な進歩にもかかわらず、ポルトガルは、公共、大規模、高品質なデータセットの不足により、未公開のままである。このギャップに対処するために,8,972時間以上のポッドキャスト音声からなるTAGARELAという新しいデータセットを提案する。特筆すべきは、その規模はイギリスのGigaSpeech (10kh)と競合し、最先端のポルトガルのモデルを可能にすることである。データ品質を確保するため、コーパスはオーディオ前処理パイプラインの対象となり、その後、混合戦略を用いて書き起こされる:我々は以前プロプライエタリなAPIによって生成された高忠実な書き起こしをトレーニングしたASRモデルを適用し、高いレベルの初期精度を確保した。最後に、この新たなリソースの有効性を検証するために、我々のデータセットに専用に訓練されたASRおよびTSモデルを提示し、その性能を評価し、ポルトガル語のためのより堅牢で自然な音声技術の開発を促進する可能性を示す。このデータセットは、堅牢な音声技術の開発を促進するために、https://freds0.github.io/TAGARELA/で公開されている。

論文の概要: Tagarela - A Portuguese speech dataset from podcasts

関連論文リスト