Fugu-MT 論文翻訳(概要): Closing the Gap Between Text and Speech Understanding in LLMs

論文の概要: Closing the Gap Between Text and Speech Understanding in LLMs

arxiv url: http://arxiv.org/abs/2510.13632v1
Date: Wed, 15 Oct 2025 14:57:16 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-16 20:13:28.722966
Title: Closing the Gap Between Text and Speech Understanding in LLMs
Title（参考訳）: LLMにおけるテキストと音声理解のギャップを埋める
Authors: Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh,
Abstract要約: 大規模言語モデルは、テキスト機能を音声入力に拡張するために適応することができる。これらの言語適応型LLMは、テキストベースのものよりも一貫して性能が劣っている。 SALAD-Sample- efficient Alignment with Learning through Active selection and cross-modal Distillation。
参考スコア（独自算出の注目度）: 28.538793793887223
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
Abstract（参考訳）: LLM(Large Language Models)は、テキストを音声入力に拡張するために適応することができる。しかし、これらの言語適応LLMは、言語理解タスクにおいて、テキストベースのパイプラインやカスケードパイプラインよりも一貫してパフォーマンスが低い。音声適応LLMが音声入力を処理するときのパフォーマンス低下は、元のテキストベースLLMが等価テキストを処理するときと比較される。このギャップを狭めるための最近のアプローチは、コストがかかり、合成データに大きく依存するテキストコーパスの大規模な音声合成や、再現不可能な大規模プロプライエタリな音声データセットに依存する。結果として、テキスト音声理解のギャップを埋めるために、よりデータ効率のよい代替手段が必要である。本研究では,2つの要因によって引き起こされるギャップを分析する。一適応中のテキストの能力を忘れること、及び (二) 音声とテキストの相互不一致そこで我々は,SALAD-Sample- efficient Alignment with Learning through Active selection and cross-modal Distillation------ cross-modal distillation with target synthetic data to improve alignment while mitiging。 3B と 7B の LLM に適用すると、SALAD は知識、言語理解、推論の幅広い領域のベンチマークにまたがる強力なオープンウェイトモデルで競合性能を達成しつつ、公開コーパスからの音声データの桁違いのトレーニングを行っている。

関連論文リスト

Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models [12.263637152835713]
LSLM(End-to-end Large Speech Language Models)では,会話生成能力が顕著に向上している。粗くきめ細かなテキストと音声表現の両方を解析する。表現類似性はモダリティギャップと強く相関していることがわかった。
論文参考訳（メタデータ） (2025-10-14T03:34:38Z)
MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speechは、テキストガイダンスに頼ることなく直接理解し、音声を生成する、真の音声音声合成大言語モデルである。我々の研究は、表現的かつ効率的なエンドツーエンドの音声対話のための新しいパラダイムを確立する。
論文参考訳（メタデータ） (2025-10-01T04:32:37Z)
Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data [5.118833405217628]
音声言語モデル(LALM)は、音声関連タスクの強力なツールとして登場したが、微調整には未熟なままである。テキストのみ、直接混合、カリキュラム学習などの微調整方式が音声言語理解(SLU)に与える影響を示す。言語間SLUでは、ソース言語音声データとターゲット言語テキストと、最小のターゲット言語音声データを組み合わせることで、効果的な適応が可能となる。
論文参考訳（メタデータ） (2025-09-18T19:54:08Z)
ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
本稿では,韻律学習に適した単純なトークン化方式であるProsodyLMを提案する。 ProsodyLMは事前学習だけで驚くほど多様なプロソディ処理能力を学習できることがわかった。
論文参考訳（メタデータ） (2025-07-27T00:59:01Z)
TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment [15.899112804399193]
textbfTESU-LLMは,テキストデータのみを用いた音声対応言語モデルの学習を可能にする新しいフレームワークである。我々の重要な洞察は、意味論的に等価なテキストと音声入力を共有潜在空間にマッピングする統一エンコーダを活用することである。 TESU-LLMはテキストのみに訓練されているにもかかわらず、様々な音声関連ベンチマークで高い性能を達成している。
論文参考訳（メタデータ） (2025-06-01T09:27:55Z)
Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation [20.415410280412697]
本研究では,大言語モデル (LLM) 内の選択された層における音声とテキストの表現を明示的に整合させることにより,モダリティギャップを埋める適応的内部音声テキストアライメント (AI-STA) 手法を提案する。音声翻訳タスクにおける実験結果から、AI-STAは、従来の最先端手法よりも大きな音声テキストモデル(LSM)の翻訳性能を大幅に向上することが示された。
論文参考訳（メタデータ） (2025-03-13T09:54:35Z)
DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
最近のエンドツーエンド言語モデル(SLM)は、大規模言語モデル(LLM)の機能に拡張されている。音声とテキストのペアデータを生成するための,シンプルで効果的な自動処理手法を提案する。本モデルでは,音声教育データを必要としない音声関連タスクの汎用性を示す。
論文参考訳（メタデータ） (2024-09-30T07:01:21Z)
Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
生成型大規模言語モデル(LLM)生成コンテキスト情報の利用について検討する。自己教師型音声モデルの微調整中に生成した情報を抽出する手法を提案する。本稿では,SLUE と Libri-light のベンチマークを用いて,自動音声認識,名前付きエンティティ認識,感情分析を行う手法を提案する。
論文参考訳（メタデータ） (2023-12-15T15:46:02Z)
BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
音声とテキスト間のモダリティアライメントは未解決の問題です本稿では,継続文の動作アライメントによるLanguage-Speech事前学習をブートストラップするBLSP手法を提案する。この簡単な処理により、ゼロショットの言語間シナリオであっても、音声認識、音声翻訳、音声言語理解、音声会話が可能なLLMの能力を音声に拡張できることを実証する。
論文参考訳（メタデータ） (2023-09-02T11:46:05Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。