TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
- URL: http://arxiv.org/abs/2508.07014v2
- Date: Tue, 12 Aug 2025 14:25:57 GMT
- Title: TurboBias: Universal ASR Context-Biasing powered by GPU-accelerated Phrase-Boosting Tree
- Authors: Andrei Andrusenko, Vladimir Bataev, Lilit Grigoryan, Vitaly Lavrukhin, Boris Ginsburg,
- Abstract summary: This paper proposes a universal context-biasing framework for Automatic Speech Recognition (ASR)<n>The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding.<n>The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches.
- Score: 17.16475665648591
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recognizing specific key phrases is an essential task for contextualized Automatic Speech Recognition (ASR). However, most existing context-biasing approaches have limitations associated with the necessity of additional model training, significantly slow down the decoding process, or constrain the choice of the ASR system type. This paper proposes a universal ASR context-biasing framework that supports all major types: CTC, Transducers, and Attention Encoder-Decoder models. The framework is based on a GPU-accelerated word boosting tree, which enables it to be used in shallow fusion mode for greedy and beam search decoding without noticeable speed degradation, even with a vast number of key phrases (up to 20K items). The obtained results showed high efficiency of the proposed method, surpassing the considered open-source context-biasing approaches in accuracy and decoding speed. Our context-biasing framework is open-sourced as a part of the NeMo toolkit.
Related papers
- SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition [0.8921166277011348]
Single-word Automatic Speech Recognition is a challenging task due to the lack of linguistic context.<n>This paper reviews recent deep learning approaches and proposes a modular framework for robust single-word detection.<n>We evaluate the framework on the Google Speech Commands dataset and a real-world dataset collected from telephony and messaging platforms under bandwidth-limited conditions.
arXiv Detail & Related papers (2026-01-28T04:50:04Z) - Pushing the Limits of Beam Search Decoding for Transducer-based ASR models [18.41716157723428]
beam search significantly slows down Transducers due to repeated evaluations of key network components.<n>This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++.
arXiv Detail & Related papers (2025-05-30T19:42:48Z) - NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - Context-aware Fine-tuning of Self-supervised Speech Models [56.95389222319555]
We study the use of context, i.e., surrounding segments, during fine-tuning.
We propose a new approach called context-aware fine-tuning.
We evaluate the proposed approach using the SLUE and Libri-light benchmarks for several downstream tasks.
arXiv Detail & Related papers (2022-12-16T15:46:15Z) - Learning Commonsense-aware Moment-Text Alignment for Fast Video Temporal
Grounding [78.71529237748018]
Grounding temporal video segments described in natural language queries effectively and efficiently is a crucial capability needed in vision-and-language fields.
Most existing approaches adopt elaborately designed cross-modal interaction modules to improve the grounding performance.
We propose a commonsense-aware cross-modal alignment framework, which incorporates commonsense-guided visual and text representations into a complementary common space.
arXiv Detail & Related papers (2022-04-04T13:07:05Z) - Neural Vocoder is All You Need for Speech Super-resolution [56.84715616516612]
Speech super-resolution (SR) is a task to increase speech sampling rate by generating high-frequency components.
Existing speech SR methods are trained in constrained experimental settings, such as a fixed upsampling ratio.
We propose a neural vocoder based speech super-resolution method (NVSR) that can handle a variety of input resolution and upsampling ratios.
arXiv Detail & Related papers (2022-03-28T17:51:00Z) - End-to-end contextual asr based on posterior distribution adaptation for
hybrid ctc/attention system [61.148549738631814]
End-to-end (E2E) speech recognition architectures assemble all components of traditional speech recognition system into a single model.
Although it simplifies ASR system, it introduces contextual ASR drawback: the E2E model has worse performance on utterances containing infrequent proper nouns.
We propose to add a contextual bias attention (CBA) module to attention based encoder decoder (AED) model to improve its ability of recognizing the contextual phrases.
arXiv Detail & Related papers (2022-02-18T03:26:02Z) - Tree-constrained Pointer Generator for End-to-end Contextual Speech
Recognition [16.160767678589895]
TCPGen is proposed that incorporates such knowledge as a list of biasing words into both attention-based encoder-decoder and transducer end-to-end ASR models.
TCPGen structures the biasing words into an efficient prefix tree to serve as its symbolic input and creates a neural shortcut to facilitate recognising biasing words during decoding.
arXiv Detail & Related papers (2021-09-01T21:41:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.