Closing the Gap Between Text and Speech Understanding in LLMs
- URL: http://arxiv.org/abs/2510.13632v1
- Date: Wed, 15 Oct 2025 14:57:16 GMT
- Title: Closing the Gap Between Text and Speech Understanding in LLMs
- Authors: Santiago Cuervo, Skyler Seto, Maureen de Seyssel, Richard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly, Zakaria Aldeneh,
- Abstract summary: Large Language Models can be adapted to extend their text capabilities to speech inputs.<n>These speech-adapted LLMs consistently underperform their text-based counterparts.<n>We introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation.
- Score: 28.538793793887223
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) can be adapted to extend their text capabilities to speech inputs. However, these speech-adapted LLMs consistently underperform their text-based counterparts--and even cascaded pipelines--on language understanding tasks. We term this shortfall the text-speech understanding gap: the performance drop observed when a speech-adapted LLM processes spoken inputs relative to when the original text-based LLM processes the equivalent text. Recent approaches to narrowing this gap either rely on large-scale speech synthesis of text corpora, which is costly and heavily dependent on synthetic data, or on large-scale proprietary speech datasets, which are not reproducible. As a result, there remains a need for more data-efficient alternatives for closing the text-speech understanding gap. In this work, we analyze the gap as driven by two factors: (i) forgetting of text capabilities during adaptation, and (ii) cross-modal misalignment between speech and text. Based on this analysis, we introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation--which combines cross-modal distillation with targeted synthetic data to improve alignment while mitigating forgetting. Applied to 3B and 7B LLMs, SALAD achieves competitive performance with a strong open-weight model across broad-domain benchmarks in knowledge, language understanding, and reasoning, while training on over an order of magnitude less speech data from public corpora.
Related papers
- Understanding the Modality Gap: An Empirical Study on the Speech-Text Alignment Mechanism of Large Speech Language Models [12.263637152835713]
End-to-end Large Speech Language Models (LSLMs) have demonstrated impressive conversational generation abilities.<n>We analyze both coarse- and fine-grained text and speech representations.<n>We find that representation similarity is strongly correlated with the modality gap.
arXiv Detail & Related papers (2025-10-14T03:34:38Z) - MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - Exploring Fine-Tuning of Large Audio Language Models for Spoken Language Understanding under Limited Speech data [5.118833405217628]
Large Audio Language Models (LALMs) have emerged as powerful tools for speech-related tasks but remain underexplored for fine-tuning.<n>We show how different fine-tuning schemes including text-only, direct mixing, and curriculum learning affect spoken language understanding (SLU)<n>In cross-lingual SLU, combining source-language speech data with target-language text and minimal target-language speech data enables effective adaptation.
arXiv Detail & Related papers (2025-09-18T19:54:08Z) - ProsodyLM: Uncovering the Emerging Prosody Processing Capabilities in Speech Language Models [70.56468982313834]
We propose ProsodyLM, which introduces a simple tokenization scheme amenable to learning prosody.<n>We find that ProsodyLM can learn surprisingly diverse emerging prosody processing capabilities through pre-training alone.
arXiv Detail & Related papers (2025-07-27T00:59:01Z) - TESU-LLM: Training Speech-LLMs Without Speech via Unified Encoder Alignment [15.899112804399193]
We present textbfTESU-LLM, a novel framework that enables training speech-capable language models using only text data.<n>Our key insight is to leverage a unified encoder that maps semantically equivalent text and speech inputs to a shared latent space.<n>Despite being trained exclusively on text, TESU-LLM achieves strong performance on various speech-related benchmarks.
arXiv Detail & Related papers (2025-06-01T09:27:55Z) - From Alignment to Advancement: Bootstrapping Audio-Language Alignment with Synthetic Data [55.2480439325792]
Audio-aware large language models (ALLMs) have recently made great strides in understanding and processing audio inputs.<n>These models are typically adapted from text-based large language models (LLMs) through additional training on audio-related tasks.<n>We propose a data generation framework that produces contrastive-like training data, designed to enhance ALLMs' ability to differentiate between present and absent sounds.
arXiv Detail & Related papers (2025-05-26T16:08:41Z) - Adaptive Inner Speech-Text Alignment for LLM-based Speech Translation [20.415410280412697]
We propose an Adaptive Inner Speech-Text Alignment (AI-STA) method to bridge the modality gap by explicitly aligning speech and text representations at selected layers within large language models (LLMs)<n> Experimental results on speech translation tasks demonstrate that AI-STA significantly improves the translation performance of large speech-text models (LSMs), outperforming previous state-of-the-art approaches.
arXiv Detail & Related papers (2025-03-13T09:54:35Z) - DeSTA2: Developing Instruction-Following Speech Language Model Without Speech Instruction-Tuning Data [84.01401439030265]
Recent end-to-end speech language models (SLMs) have expanded upon the capabilities of large language models (LLMs)<n>We present a simple yet effective automatic process for creating speech-text pair data.<n>Our model demonstrates general capabilities for speech-related tasks without the need for speech instruction-tuning data.
arXiv Detail & Related papers (2024-09-30T07:01:21Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Generative Context-aware Fine-tuning of Self-supervised Speech Models [54.389711404209415]
We study the use of generative large language models (LLM) generated context information.
We propose an approach to distill the generated information during fine-tuning of self-supervised speech models.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks: automatic speech recognition, named entity recognition, and sentiment analysis.
arXiv Detail & Related papers (2023-12-15T15:46:02Z) - BLSP: Bootstrapping Language-Speech Pre-training via Behavior Alignment of Continuation Writing [35.31866559807704]
modality alignment between speech and text remains an open problem.
We propose the BLSP approach that bootstraps Language-Speech Pre-training via behavior alignment of continuation writing.
We demonstrate that this straightforward process can extend the capabilities of LLMs to speech, enabling speech recognition, speech translation, spoken language understanding, and speech conversation, even in zero-shot cross-lingual scenarios.
arXiv Detail & Related papers (2023-09-02T11:46:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.