Related papers: Blank Collapse: Compressing CTC emission for the faster decoding

Blank Collapse: Compressing CTC emission for the faster decoding

URL: http://arxiv.org/abs/2210.17017v2
Date: Tue, 27 Jun 2023 00:39:38 GMT
Title: Blank Collapse: Compressing CTC emission for the faster decoding
Authors: Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo
Abstract summary: We propose a method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding.
Score: 0.30108936184913293
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Connectionist Temporal Classification (CTC) model is a very efficient method for modeling sequences, especially for speech data. In order to use CTC model as an Automatic Speech Recognition (ASR) task, the beam search decoding with an external language model like n-gram LM is necessary to obtain reasonable results. In this paper we analyze the blank label in CTC beam search deeply and propose a very simple method to reduce the amount of calculation resulting in faster beam search decoding speed. With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding with a very small loss of accuracy in LibriSpeech datasets. We prove this method is effective not only practically by experiments but also theoretically by mathematical reasoning. We also observe that this reduction is more obvious if the accuracy of the model is higher.

Related papers

Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE) PIE reduces computational overhead by over 85% compared to the standard full recomputation approach. Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z)
Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter. The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates. The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z)
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition [1.2680687621338012]
Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines. We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models. It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
arXiv Detail & Related papers (2023-11-08T19:57:10Z)
A Token-Wise Beam Search Algorithm for RNN-T [3.682821163882332]
We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output.
arXiv Detail & Related papers (2023-02-28T07:20:49Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
Adding Connectionist Temporal Summarization into Conformer to Improve Its Decoder Efficiency For Speech Recognition [22.61761934996406]
We propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder. With a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20%. The word error rate (WER) is reduced by 6% relative at the beam width of 1 and by 3% relative at the beam width of 4.
arXiv Detail & Related papers (2022-04-08T07:24:00Z)
Cascaded Fast and Slow Models for Efficient Semantic Code Search [46.53530668938728]
We propose an efficient and accurate semantic code search framework with cascaded fast and slow models. The proposed cascaded approach is not only efficient and scalable, but also achieves state-of-the-art results.
arXiv Detail & Related papers (2021-10-15T02:23:35Z)
Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder. Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z)
FSR: Accelerating the Inference Process of Transducer-Based Models by Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state. The number of blank tokens in the prediction results accounts for nearly 90% of all tokens. We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z)
Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective. We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z)
End-to-end Sinkhorn Autoencoder with Noise Generator [10.008055997630304]
We propose a novel end-to-end sinkhorn autoencoder with noise generator for efficient data collection simulation. Our method outperforms competing approaches on a challenging dataset of simulation data from Zero Degree Calorimeters of ALICE experiment in LHC.
arXiv Detail & Related papers (2020-06-11T18:04:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.