Fugu-MT 論文翻訳(概要): PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

論文の概要: PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

arxiv url: http://arxiv.org/abs/2606.01016v1
Date: Sun, 31 May 2026 05:13:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-02 21:34:29.07512
Title: PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
Title（参考訳）: PolySpeech-100: 100以上の言語と方言の音声理解のための大規模ベンチマーク
Authors: Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu, Lu Fan, Zhi Li, You He,
Abstract要約: PolySpeech-100は110の言語変種にわたるネイティブレベルの音声理解を評価するために設計された大規模ベンチマークである。我々は、指示駆動合成音声によるゴールドスタンダードな人間の録音を増強する、新しいハイブリッド構築パイプラインを採用している。
参考スコア（独自算出の注目度）: 29.32197370490759
License: http://creativecommons.org/licenses/by/4.0/
Abstract: While End-to-End (E2E) Speech-Large Language Models (Speech-LLMs) are rapidly evolving, their evaluation methodologies remain limited to the era of simple transcription. Existing benchmarks suffer from three critical limitations: a pronounced bias towards high-resource languages, a focus on low-level recognition (ASR) rather than semantic reasoning, and a neglect of regional dialects. To bridge this gap, we introduce PolySpeech-100, a massive-scale benchmark designed to assess `native-level' speech comprehension across 110 linguistic variants. We employ a novel hybrid construction pipeline that augments gold-standard human recordings with instruction-driven synthetic speech, allowing us to cover 19 distinct Chinese dialects and over 80 low-resource languages. Extensive evaluation of 22 state-of-the-art models (including Gemini-3, GPT-Audio, and Qwen2.5-Omni) yields pivotal insights. First, we demonstrate that open-source E2E models outperform Cascade (ASR+LLM) systems on heavy dialects, proving that direct audio processing preserves critical paralinguistic cues and prosodic features (e.g., intonation, stress) that are often lost in standard transcription. Second, we reveal a significant performance gap: while commercial models maintain robustness, open-source models suffer catastrophic degradation on low-resource languages. Finally, counter-intuitively, we observe that under standard zero-shot settings, Chain-of-Thought prompting frequently degrades speech understanding performance for most evaluated models, revealing a potential modality alignment gap in current architectures. PolySpeech-100 establishes a rigorous standard for the next generation of inclusive, omni-capable Speech-LLMs. The data, demo, and code are publicly available at https://github.com/YoungSeng/PolySpeech-100.
Abstract（参考訳）: E2E(End-to-End)音声言語モデル(Speech-LLMs)は急速に進化しているが、その評価手法は単純な転写の時代に限られている。既存のベンチマークには、3つの重要な制限がある: ハイリソース言語に対する顕著な偏見、意味論的推論よりも低レベル認識(ASR)に焦点を当てること、地域方言を無視すること。このギャップを埋めるために、110の言語変種にわたる'ネイティブレベル'音声理解を評価するために設計された大規模ベンチマークであるPolySpeech-100を導入する。我々は、命令駆動合成音声でゴールドスタンダードの人間の録音を増強し、19の異なる中国語方言と80以上の低リソース言語をカバーできる新しいハイブリッド構築パイプラインを採用している。 Gemini-3, GPT-Audio, Qwen2.5-Omniを含む22種類の最先端モデルの大規模評価は、重要な洞察を与える。まず、オープンソースのE2Eモデルは、重い方言上でカスケード(ASR+LLM)システムより優れており、直接オーディオ処理は、標準転写においてしばしば失われる重要なパラ言語的手がかりと韻律的特徴(例えば、イントネーション、ストレス)を保っていることを証明している。商用モデルは堅牢性を維持しながら、オープンソースモデルは低リソース言語で破滅的な劣化を被る。最後に,標準のゼロショット設定下では,Chain-of-Thoughtは,ほとんどの評価されたモデルに対して,発話理解性能を劣化させ,現在のアーキテクチャにおける潜在的なモダリティアライメントギャップを明らかにする。 PolySpeech-100は、次世代の包括的でオールニ対応の音声-LLMの厳格な規格を確立している。データ、デモ、コードはhttps://github.com/YoungSeng/PolySpeech-100.comで公開されている。

論文の概要: PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

関連論文リスト