SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition
- URL: http://arxiv.org/abs/2601.20890v1
- Date: Wed, 28 Jan 2026 04:50:04 GMT
- Title: SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition
- Authors: Manali Sharma, Riya Naik, Buvaneshwari G,
- Abstract summary: Single-word Automatic Speech Recognition is a challenging task due to the lack of linguistic context.<n>This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection.<n>We evaluate the framework on the Google Speech Commands dataset and a real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions.
- Score: 0.8921166277011348
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Single-word Automatic Speech Recognition (ASR) is a challenging task due to the lack of linguistic context and sensitivity to noise, pronunciation variation, and channel artifacts, especially in low-resource, communication-critical domains such as healthcare and emergency response. This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection. The system combines denoising and normalization with a hybrid ASR front end (Whisper + Vosk) and a verification layer designed to handle out-of-vocabulary words and degraded audio. The verification layer supports multiple matching strategies, including embedding similarity, edit distance, and LLM-based matching with optional contextual guidance. We evaluate the framework on the Google Speech Commands dataset and a curated real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions. Results show that while the hybrid ASR front end performs well on clean audio, the verification layer significantly improves accuracy on noisy and compressed channels. Context-guided and LLM-based matching yield the largest gains, demonstrating that lightweight verification and context mechanisms can substantially improve single-word ASR robustness without sacrificing latency required for real-time telephony applications.
Related papers
- Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage [66.67531241554546]
End-to-end speech-in speech-out dialogue systems are emerging as a powerful alternative to traditional ASR-LLM-TTS pipelines.<n>We introduce the first approach to extend tool use directly into speech-in speech-out systems.<n>We propose Streaming Retrieval-Augmented Generation (Streaming RAG), a novel framework that reduces user-perceived latency by predicting tool queries in parallel with user speech.
arXiv Detail & Related papers (2025-10-02T14:18:20Z) - Index-MSR: A high-efficiency multimodal fusion framework for speech recognition [7.677016652056559]
Index-MSR is an efficient multimodal speech recognition framework.<n>MFD effectively incorporates text-related information from videos into the speech recognition.<n>We show that Index-MSR achieves sota accuracy, with substitution errors reduced by 20,50%.
arXiv Detail & Related papers (2025-09-26T03:47:15Z) - WavRAG: Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models [49.725968706743586]
WavRAG is the first retrieval augmented generation framework with native, end-to-end audio support.<n>We propose the WavRetriever to facilitate the retrieval from a text-audio hybrid knowledge base.<n>In comparison to state-of-the-art ASR-Text RAG pipelines, WavRAG achieves comparable retrieval performance while delivering a 10x acceleration.
arXiv Detail & Related papers (2025-02-20T16:54:07Z) - Towards ASR Robust Spoken Language Understanding Through In-Context
Learning With Word Confusion Networks [68.79880423713597]
We introduce a method that utilizes the ASR system's lattice output instead of relying solely on the top hypothesis.
Our in-context learning experiments, covering spoken question answering and intent classification, underline the LLM's resilience to noisy speech transcripts.
arXiv Detail & Related papers (2024-01-05T17:58:10Z) - Speech enhancement with frequency domain auto-regressive modeling [34.55703785405481]
Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation.
We propose a unified framework of speech dereverberation for improving the speech quality and the automatic speech recognition (ASR) performance.
arXiv Detail & Related papers (2023-09-24T03:25:51Z) - Exploring the Integration of Speech Separation and Recognition with
Self-Supervised Learning Representation [83.36685075570232]
This work provides an insightful investigation of speech separation in reverberant and noisy-reverberant scenarios as an ASR front-end.
We explore multi-channel separation methods, mask-based beamforming and complex spectral mapping, as well as the best features to use in the ASR back-end model.
A proposed integration using TF-GridNet-based complex spectral mapping and WavLM-based SSLR achieves a 2.5% word error rate in reverberant WHAMR! test set.
arXiv Detail & Related papers (2023-07-23T05:39:39Z) - Topic Identification For Spontaneous Speech: Enriching Audio Features
With Embedded Linguistic Information [10.698093106994804]
Traditional topic identification solutions from audio rely on an automatic speech recognition system (ASR) to produce transcripts.
We compare audio-only and hybrid techniques of jointly utilising text and audio features.
The models evaluated on spontaneous Finnish speech demonstrate that purely audio-based solutions are a viable option when ASR components are not available.
arXiv Detail & Related papers (2023-07-21T09:30:46Z) - Audio-visual End-to-end Multi-channel Speech Separation, Dereverberation
and Recognition [52.11964238935099]
An audio-visual multi-channel speech separation, dereverberation and recognition approach is proposed in this paper.
Video input is consistently demonstrated in mask-based MVDR speech separation, DNN-WPE or spectral mapping (SpecM) based speech dereverberation front-end.
Experiments were conducted on the mixture overlapped and reverberant speech data constructed using simulation or replay of the Oxford LRS2 dataset.
arXiv Detail & Related papers (2023-07-06T10:50:46Z) - Leveraging Modality-specific Representations for Audio-visual Speech
Recognition via Reinforcement Learning [25.743503223389784]
We propose a reinforcement learning (RL) based framework called MSRL.
We customize a reward function directly related to task-specific metrics.
Experimental results on the LRS3 dataset show that the proposed method achieves state-of-the-art in both clean and various noisy conditions.
arXiv Detail & Related papers (2022-12-10T14:01:54Z) - Deliberation Model for On-Device Spoken Language Understanding [69.5587671262691]
We propose a novel deliberation-based approach to end-to-end (E2E) spoken language understanding (SLU)
We show that our approach can significantly reduce the degradation when moving from natural speech to synthetic speech training.
arXiv Detail & Related papers (2022-04-04T23:48:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.