Related papers: TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree

URL: http://arxiv.org/abs/2508.07014v2
Date: Tue, 12 Aug 2025 14:25:57 GMT
Title: TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
Authors: Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg,
Abstract summary: This paper proposes a universal context-biasing framework for Automatic Speech Recognition (ASR)<n>The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding.<n>The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches.
Score: 17.16475665648591
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.

Related papers

SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition [0.8921166277011348]
Single-word Automatic Speech Recognition is a challenging task due to the lack of linguistic context.<n>This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection.<n>We evaluate the framework on the Google Speech Commands dataset and a real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions.
arXiv Detail & Related papers (2026-01-28T04:50:04Z)
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models [18.41716157723428]
beam search significantly slows down Transducers due to repeated evaluations of key network components.<n>This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++.
arXiv Detail & Related papers (2025-05-30T19:42:48Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z)
Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning. We propose a new approach called context-aware fine-tuning. We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z)
Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields. Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance. We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z)
Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components. Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio. We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z)
End-to-end contextual asr based on posterior distribution adaptation for hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model. Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns. We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z)
Tree-constrained Pointer Generator for End-to-end Contextual Speech Recognition [16.160767678589895]
TCPGen is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models. TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut to facilitate recognising biasing words during decoding.
arXiv Detail & Related papers (2021-09-01T21:41:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.