Fugu-MT 論文翻訳(概要): ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

論文の概要: ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

arxiv url: http://arxiv.org/abs/2509.22808v1
Date: Fri, 26 Sep 2025 18:11:20 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-30 22:32:18.883729
Title: ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection
Title（参考訳）: ArFake: アラビア音声検出のための多次元ベンチマークとベースライン
Authors: Mohamed Maged, Alhassan Ehab, Ali Mekky, Besher Hassan, Shady Shehata,
Abstract要約: アラビア語スプーフ音声データセットを初めて紹介する。以上の結果から,FishSpeechはカサブランカコーパスのアラビア語音声クローニングにおいて,他のTSモデルよりも優れていた。
参考スコア（独自算出の注目度）: 2.5962590697722447
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the rise of generative text-to-speech models, distinguishing between real and synthetic speech has become challenging, especially for Arabic that have received limited research attention. Most spoof detection efforts have focused on English, leaving a significant gap for Arabic and its many dialects. In this work, we introduce the first multi-dialect Arabic spoofed speech dataset. To evaluate the difficulty of the synthesized audio from each model and determine which produces the most challenging samples, we aimed to guide the construction of our final dataset either by merging audios from multiple models or by selecting the best-performing model, we conducted an evaluation pipeline that included training classifiers using two approaches: modern embedding-based methods combined with classifier heads; classical machine learning algorithms applied to MFCC features; and the RawNet2 architecture. The pipeline further incorporated the calculation of Mean Opinion Score based on human ratings, as well as processing both original and synthesized datasets through an Automatic Speech Recognition model to measure the Word Error Rate. Our results demonstrate that FishSpeech outperforms other TTS models in Arabic voice cloning on the Casablanca corpus, producing more realistic and challenging synthetic speech samples. However, relying on a single TTS for dataset creation may limit generalizability.
Abstract（参考訳）: 生成的テキスト音声モデルの台頭により、実際の音声と合成音声の区別が困難になり、特に研究の関心が限られたアラビア人にとっては困難になっている。ほとんどのスプーフ検出は英語に重点を置いており、アラビア語とその多くの方言に大きなギャップを残している。本研究では,アラビア語スプーフ音声データセットを初めて紹介する。各モデルからの合成音声の難易度を評価し,最も困難なサンプルを生成するかを決定するため,複数のモデルからオーディオをマージするか,最高の性能モデルを選択することによって最終データセットの構築をガイドすることを目的とした。パイプラインはさらに、人間のレーティングに基づいた平均オピニオンスコアの計算、およびワードエラー率を測定するために、自動音声認識モデルを通じて、オリジナルのデータセットと合成データセットの両方を処理した。以上の結果から,FishSpeechはカサブランカコーパスのアラビア音声クローニングにおいて他のTSモデルよりも優れており,より現実的で難しい合成音声サンプルが得られた。しかし、データセット生成に単一のTSに頼ることで、一般化性が制限される可能性がある。

論文の概要: ArFake: A Multi-Dialect Benchmark and Baselines for Arabic Spoof-Speech Detection

関連論文リスト