Fugu-MT 論文翻訳(概要): Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

論文の概要: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

arxiv url: http://arxiv.org/abs/2510.12116v1
Date: Tue, 14 Oct 2025 03:34:38 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-15 19:02:32.176143
Title: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models
Title（参考訳）: モダリティギャップを理解する:大規模言語モデルの音声テキストアライメント機構に関する実証的研究
Authors: Bajian Xiang, Shuaijiang Zhao, Tingwei Guo, Wei Zou,
Abstract要約: LSLM(End-to-end Large Speech Language Models)では,会話生成能力が顕著に向上している。粗くきめ細かなテキストと音声表現の両方を解析する。表現類似性はモダリティギャップと強く相関していることがわかった。
参考スコア（独自算出の注目度）: 12.263637152835713
License: http://creativecommons.org/licenses/by/4.0/
Abstract: End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities, yet consistently fall short of traditional pipeline systems on semantic understanding benchmarks. In this work, we reveal through systematic experimentation that although LSLMs lose some text input performance after speech-text alignment training, the performance gap between speech and text inputs is more pronounced, which we refer to as the modality gap. To understand this gap, we analyze both coarse- and fine-grained text and speech representations. At the coarse-grained level, representations of speech and text in deeper layers are found to be increasingly aligned in direction (cosine similarity), while concurrently diverging in magnitude (Euclidean distance). We further find that representation similarity is strongly correlated with the modality gap. At the fine-grained level, a spontaneous token-level alignment pattern between text and speech representations is observed. Based on this, we introduce the Alignment Path Score to quantify token-level alignment quality, which exhibits stronger correlation with the modality gap. Building on these insights, we design targeted interventions on critical tokens through angle projection and length normalization. These strategies demonstrate the potential to improve correctness for speech inputs. Our study provides the first systematic empirical analysis of the modality gap and alignment mechanisms in LSLMs, offering both theoretical and methodological guidance for future optimization.
Abstract（参考訳）: エンド・ツー・エンドのLarge Speech Language Models (LSLM) は、会話生成能力を示すが、セマンティック理解ベンチマークでは従来のパイプラインシステムには一貫して劣っている。本研究では,LSLMが音声テキストアライメント訓練後にテキスト入力性能を損なうが,音声とテキストの入力間の性能差はより顕著であり,モダリティギャップ(Modality gap)と呼ぶことを,体系的な実験を通じて明らかにした。このギャップを理解するために、粗いテキストと微粒なテキストと音声の表現の両方を分析する。粗粒度レベルでは、より深い層における音声とテキストの表現は、大きさ(ユークリッド距離)を同時に変化させながら、方向(コサイン類似性)にますます整列していることが分かる。さらに,表現類似性はモダリティギャップと強く相関していることがわかった。きめ細かいレベルでは、テキストと音声表現の間の自発的なトークンレベルアライメントパターンが観察される。これに基づいて、アライメントパススコアを導入し、トークンレベルのアライメント品質を定量化し、モダリティギャップとより強い相関を示す。これらの知見に基づいて、角度投影と長さ正規化による臨界トークンに対する標的介入を設計する。これらの戦略は、音声入力の正確性を改善する可能性を実証する。本研究は,LSLMにおけるモダリティギャップとアライメント機構の体系的解析を行い,将来の最適化のための理論的および方法論的ガイダンスを提供する。

論文の概要: Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models

関連論文リスト