Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
- URL: http://arxiv.org/abs/2511.11139v1
- Date: Fri, 14 Nov 2025 10:15:16 GMT
- Title: Speech-Aware Long Context Pruning and Integration for Contextualized Automatic Speech Recognition
- Authors: Yiming Rong, Yixin Zhang, Ziyi Wang, Deyang Jiang, Yunlong Zhao, Haoran Wu, Shiyu Zhou, Bo Xu,
- Abstract summary: We propose a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages.<n> Experimental results demonstrate state-of-the-art performance of SAP$2$ on the SlideSpeech and LibriSpeech datasets.
- Score: 34.35034351903119
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Automatic speech recognition (ASR) systems have achieved remarkable performance in common conditions but often struggle to leverage long-context information in contextualized scenarios that require domain-specific knowledge, such as conference presentations. This challenge arises primarily due to constrained model context windows and the sparsity of relevant information within extensive contextual noise. To solve this, we propose the SAP$^{2}$ method, a novel framework that dynamically prunes and integrates relevant contextual keywords in two stages. Specifically, each stage leverages our proposed Speech-Driven Attention-based Pooling mechanism, enabling efficient compression of context embeddings while preserving speech-salient information. Experimental results demonstrate state-of-the-art performance of SAP$^{2}$ on the SlideSpeech and LibriSpeech datasets, achieving word error rates (WER) of 7.71% and 1.12%, respectively. On SlideSpeech, our method notably reduces biased keyword error rates (B-WER) by 41.1% compared to non-contextual baselines. SAP$^{2}$ also exhibits robust scalability, consistently maintaining performance under extensive contextual input conditions on both datasets.
Related papers
- Covo-Audio Technical Report [61.09708870154148]
Covo-Audio, a 7B-end LALM, directly processes continuous audio inputs and generates audio outputs within a single unified architecture.<n>Covo-Audio-Chat, a dialogue-oriented variant, demonstrates semantic strong spoken conversational abilities.
arXiv Detail & Related papers (2026-02-10T14:31:11Z) - Closing the Gap Between Text and Speech Understanding in LLMs [28.538793793887223]
Large Language Models can be adapted to extend their text capabilities to speech inputs.<n>These speech-adapted LLMs consistently underperform their text-based counterparts.<n>We introduce SALAD--Sample-efficient Alignment with Learning through Active selection and cross-modal Distillation.
arXiv Detail & Related papers (2025-10-15T14:57:16Z) - MOSS-Speech: Towards True Speech-to-Speech Models Without Text Guidance [66.74042564585942]
MOSS-Speech is a true speech-to-speech large language model that directly understands and generates speech without relying on text guidance.<n>Our work establishes a new paradigm for expressive and efficient end-to-end speech interaction.
arXiv Detail & Related papers (2025-10-01T04:32:37Z) - Incorporating Contextual Paralinguistic Understanding in Large Speech-Language Models [19.864555505996112]
We propose two approaches to incorporate contextual paralinguistic information into model training.<n>Our implicit method boosts performance (LLM-judged) by 38.41% on a human-annotated QA benchmark, reaching 46.02% when combined with the explicit approach.
arXiv Detail & Related papers (2025-08-10T10:03:30Z) - Contextual Speech Extraction: Leveraging Textual History as an Implicit Cue for Target Speech Extraction [50.630431647192054]
This paper investigates a novel approach for Target Speech Extraction (TSE)<n>It relies solely on textual context to extract the target speech.<n>We present three CSE models and analyze their performances on three datasets.
arXiv Detail & Related papers (2025-03-11T18:26:10Z) - Double Mixture: Towards Continual Event Detection from Speech [60.33088725100812]
Speech event detection is crucial for multimedia retrieval, involving the tagging of both semantic and acoustic events.
This paper tackles two primary challenges in speech event detection: the continual integration of new events without forgetting previous ones, and the disentanglement of semantic from acoustic events.
We propose a novel method, 'Double Mixture,' which merges speech expertise with robust memory mechanisms to enhance adaptability and prevent forgetting.
arXiv Detail & Related papers (2024-04-20T06:32:00Z) - ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph
Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech.
We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding.
We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement.
Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z) - SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding
Tasks [88.4408774253634]
Spoken language understanding (SLU) tasks have been studied for many decades in the speech research community.
There are not nearly as many SLU task benchmarks, and many of the existing ones use data that is not freely available to all researchers.
Recent work has begun to introduce such benchmark for several tasks.
arXiv Detail & Related papers (2022-12-20T18:39:59Z) - Leveraging Acoustic Contextual Representation by Audio-textual
Cross-modal Learning for Conversational ASR [25.75615870266786]
We propose an audio-textual cross-modal representation extractor to learn contextual representations directly from preceding speech.
The effectiveness of the proposed approach is validated on several Mandarin conversation corpora.
arXiv Detail & Related papers (2022-07-03T13:32:24Z) - Two-stage Textual Knowledge Distillation for End-to-End Spoken Language
Understanding [18.275646344620387]
This work proposes a two-stage textual knowledge distillation method that matches utterance-level representations and predicted logits of two modalities during pre-training and fine-tuning.
We push the state-of-the-art on the Fluent Speech Commands, achieving 99.7% test accuracy in the full dataset setting and 99.5% in the 10% subset setting.
arXiv Detail & Related papers (2020-10-25T12:36:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.