GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech
Recognition
- URL: http://arxiv.org/abs/2311.04996v1
- Date: Wed, 8 Nov 2023 19:57:10 GMT
- Title: GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech
Recognition
- Authors: Daniel Galvez and Tim Kaldewey
- Abstract summary: Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines.
We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models.
It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
- Score: 1.2680687621338012
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While Connectionist Temporal Classification (CTC) models deliver
state-of-the-art accuracy in automated speech recognition (ASR) pipelines,
their performance has been limited by CPU-based beam search decoding. We
introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam search
decoder compatible with current CTC models. It increases pipeline throughput
and decreases latency, supports streaming inference, and also supports advanced
features like utterance-specific word boosting via on-the-fly composition. We
provide pre-built DLPack-based python bindings for ease of use with
Python-based machine learning frameworks at
https://github.com/nvidia-riva/riva-asrlib-decoder. We evaluated our decoder
for offline and online scenarios, demonstrating that it is the fastest beam
search decoder for CTC models. In the offline scenario it achieves up to 7
times more throughput than the current state-of-the-art CPU decoder and in the
online streaming scenario, it achieves nearly 8 times lower latency, with same
or better word error rate.
Related papers
- Shallow Cross-Encoders for Low-Latency Retrieval [69.06104373460597]
Cross-Encoders based on large transformer models (such as BERT or T5) are computationally expensive and allow for scoring only a small number of documents within a reasonably small latency window.
We show that weaker shallow transformer models (i.e., transformers with a limited number of layers) actually perform better than full-scale models when constrained to these practical low-latency settings.
arXiv Detail & Related papers (2024-03-29T15:07:21Z) - Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Towards Real-Time Neural Video Codec for Cross-Platform Application
Using Calibration Information [17.141950680993617]
Cross-platform computational errors resulting from floating point operations can lead to inaccurate decoding of the bitstream.
The high computational complexity of the encoding and decoding process poses a challenge in achieving real-time performance.
A real-time cross-platform neural video is capable of efficiently decoding of 720P video bitstream from other encoding platforms on a consumer-grade GPU.
arXiv Detail & Related papers (2023-09-20T13:01:15Z) - Spatiotemporal Attention-based Semantic Compression for Real-time Video
Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame.
We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information.
Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z) - Fast and parallel decoding for transducer [25.510837666148024]
We introduce a constrained version of transducer loss to learn strictly monotonic alignments between the sequences.
We also improve the standard greedy search and beam search algorithms by limiting the number of symbols that can be emitted per time step.
arXiv Detail & Related papers (2022-10-31T07:46:10Z) - Blank Collapse: Compressing CTC emission for the faster decoding [0.30108936184913293]
We propose a method to reduce the amount of calculation resulting in faster beam search decoding speed.
With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding.
arXiv Detail & Related papers (2022-10-31T02:12:51Z) - PARTIME: Scalable and Parallel Processing Over Time with Deep Neural
Networks [68.96484488899901]
We present PARTIME, a library designed to speed up neural networks whenever data is continuously streamed over time.
PARTIME starts processing each data sample at the time in which it becomes available from the stream.
Experiments are performed in order to empirically compare PARTIME with classic non-parallel neural computations in online learning.
arXiv Detail & Related papers (2022-10-17T14:49:14Z) - Streaming parallel transducer beam search with fast-slow cascaded
encoders [23.416682253435837]
Streaming and non-streaming ASR for RNN Transducers can be unified by cascading causal and non-causal encoders.
We propose a novel parallel time-synchronous beam search algorithm for transducers that decodes from fast-slow encoders.
arXiv Detail & Related papers (2022-03-29T17:29:39Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - Learning for Video Compression with Recurrent Auto-Encoder and Recurrent
Probability Model [164.7489982837475]
This paper proposes a Recurrent Learned Video Compression (RLVC) approach with the Recurrent Auto-Encoder (RAE) and Recurrent Probability Model ( RPM)
The RAE employs recurrent cells in both the encoder and decoder to exploit the temporal correlation among video frames.
Our approach achieves the state-of-the-art learned video compression performance in terms of both PSNR and MS-SSIM.
arXiv Detail & Related papers (2020-06-24T08:46:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.