Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition
- URL: http://arxiv.org/abs/2501.03257v1
- Date: Wed, 01 Jan 2025 12:20:07 GMT
- Title: Breaking Through the Spike: Spike Window Decoding for Accelerated and Precise Automatic Speech Recognition
- Authors: Wei Zhang, Tian-Hao Zhang, Chao Luo, Hui Zhou, Chao Yang, Xinyuan Qian, Xu-Cheng Yin,
- Abstract summary: The Spike Window Decoding algorithm greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output.
Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets.
- Score: 26.665132884613477
- License:
- Abstract: Recently, end-to-end automatic speech recognition has become the mainstream approach in both industry and academia. To optimize system performance in specific scenarios, the Weighted Finite-State Transducer (WFST) is extensively used to integrate acoustic and language models, leveraging its capacity to implicitly fuse language models within static graphs, thereby ensuring robust recognition while also facilitating rapid error correction. However, WFST necessitates a frame-by-frame search of CTC posterior probabilities through autoregression, which significantly hampers inference speed. In this work, we thoroughly investigate the spike property of CTC outputs and further propose the conjecture that adjacent frames to non-blank spikes carry semantic information beneficial to the model. Building on this, we propose the Spike Window Decoding algorithm, which greatly improves the inference speed by making the number of frames decoded in WFST linearly related to the number of spiking frames in the CTC output, while guaranteeing the recognition performance. Our method achieves SOTA recognition accuracy with significantly accelerates decoding speed, proven across both AISHELL-1 and large-scale In-House datasets, establishing a pioneering approach for integrating CTC output with WFST.
Related papers
- Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition [20.052245837954175]
We propose an efficient and accurate streaming speech recognition model based on the FastConformer architecture.
We introduce an activation caching mechanism to enable the non-autoregressive encoder to operate autoregressively during inference.
A hybrid CTC/RNNT architecture which utilizes a shared encoder with both a CTC and RNNT decoder to boost the accuracy and save computation.
arXiv Detail & Related papers (2023-12-27T21:04:26Z) - Key Frame Mechanism For Efficient Conformer Based End-to-end Speech
Recognition [9.803556181225193]
Conformer as a backbone network for end-to-end automatic speech recognition achieved state-of-the-art performance.
However, the Conformer-based model encounters an issue with the self-attention mechanism.
We introduce key frame-based self-attention (KFSA) mechanism, a novel method to reduce the computation of the self-attention mechanism using key frames.
arXiv Detail & Related papers (2023-10-23T13:55:49Z) - Dynamic Frame Interpolation in Wavelet Domain [57.25341639095404]
Video frame is an important low-level computation vision task, which can increase frame rate for more fluent visual experience.
Existing methods have achieved great success by employing advanced motion models and synthesis networks.
WaveletVFI can reduce computation up to 40% while maintaining similar accuracy, making it perform more efficiently against other state-of-the-arts.
arXiv Detail & Related papers (2023-09-07T06:41:15Z) - Spatiotemporal Attention-based Semantic Compression for Real-time Video
Recognition [117.98023585449808]
We propose a temporal attention-based autoencoder (STAE) architecture to evaluate the importance of frames and pixels in each frame.
We develop a lightweight decoder that leverages a 3D-2D CNN combined to reconstruct missing information.
Experimental results show that ViT_STAE can compress the video dataset H51 by 104x with only 5% accuracy loss.
arXiv Detail & Related papers (2023-05-22T07:47:27Z) - An Investigation of Enhancing CTC Model for Triggered Attention-based
Streaming ASR [19.668440671541546]
An attempt is made to combine Mask-CTC and the triggered attention mechanism to construct a streaming end-to-end automatic speech recognition (ASR) system.
The proposed method achieves higher accuracy with lower latency than the conventional triggered attention-based streaming ASR system.
arXiv Detail & Related papers (2021-10-20T06:44:58Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - WNARS: WFST based Non-autoregressive Streaming End-to-End Speech
Recognition [59.975078145303605]
We propose a novel framework, namely WNARS, using hybrid CTC-attention AED models and weighted finite-state transducers.
On the AISHELL-1 task, our WNARS achieves a character error rate of 5.22% with 640ms latency, to the best of our knowledge, which is the state-of-the-art performance for online ASR.
arXiv Detail & Related papers (2021-04-08T07:56:03Z) - Alignment Knowledge Distillation for Online Streaming Attention-based
Speech Recognition [46.69852287267763]
This article describes an efficient training method for online streaming attention-based encoder-decoder (AED) automatic speech recognition (ASR) systems.
The proposed method significantly reduces recognition errors and emission latency simultaneously.
The best MoChA system shows performance comparable to that of RNN-transducer (RNN-T)
arXiv Detail & Related papers (2021-02-28T08:17:38Z) - Improved Mask-CTC for Non-Autoregressive End-to-End ASR [49.192579824582694]
Recently proposed end-to-end ASR system based on mask-predict with connectionist temporal classification (CTC)
We propose to enhance the network architecture by employing a recently proposed architecture called Conformer.
Next, we propose new training and decoding methods by introducing auxiliary objective to predict the length of a partial target sequence.
arXiv Detail & Related papers (2020-10-26T01:22:35Z) - Boosting Continuous Sign Language Recognition via Cross Modality
Augmentation [135.30357113518127]
Continuous sign language recognition deals with unaligned video-text pair.
We propose a novel architecture with cross modality augmentation.
The proposed framework can be easily extended to other existing CTC based continuous SLR architectures.
arXiv Detail & Related papers (2020-10-11T15:07:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.