Blank Collapse: Compressing CTC emission for the faster decoding
- URL: http://arxiv.org/abs/2210.17017v2
- Date: Tue, 27 Jun 2023 00:39:38 GMT
- Title: Blank Collapse: Compressing CTC emission for the faster decoding
- Authors: Minkyu Jung, Ohhyeok Kwon, Seunghyun Seo, Soonshin Seo
- Abstract summary: We propose a method to reduce the amount of calculation resulting in faster beam search decoding speed.
With this method, we can get up to 78% faster decoding speed than ordinary beam search decoding.
- Score: 0.30108936184913293
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Connectionist Temporal Classification (CTC) model is a very efficient method
for modeling sequences, especially for speech data. In order to use CTC model
as an Automatic Speech Recognition (ASR) task, the beam search decoding with an
external language model like n-gram LM is necessary to obtain reasonable
results. In this paper we analyze the blank label in CTC beam search deeply and
propose a very simple method to reduce the amount of calculation resulting in
faster beam search decoding speed. With this method, we can get up to 78%
faster decoding speed than ordinary beam search decoding with a very small loss
of accuracy in LibriSpeech datasets. We prove this method is effective not only
practically by experiments but also theoretically by mathematical reasoning. We
also observe that this reduction is more obvious if the accuracy of the model
is higher.
Related papers
- Let the Code LLM Edit Itself When You Edit the Code [50.46536185784169]
underlinetextbfPositional textbfIntegrity textbfEncoding (PIE)
PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
Results demonstrate that PIE reduces computational overhead by over 85% compared to the standard full recomputation approach.
arXiv Detail & Related papers (2024-07-03T14:34:03Z) - Fast Context-Biasing for CTC and Transducer ASR models with CTC-based Word Spotter [57.64003871384959]
This work presents a new approach to fast context-biasing with CTC-based Word Spotter.
The proposed method matches CTC log-probabilities against a compact context graph to detect potential context-biasing candidates.
The results demonstrate a significant acceleration of the context-biasing recognition with a simultaneous improvement in F-score and WER.
arXiv Detail & Related papers (2024-06-11T09:37:52Z) - GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech
Recognition [1.2680687621338012]
Connectionist Temporal Classification ( CTC) models deliver state-of-the-art accuracy in automated speech recognition (ASR) pipelines.
We introduce a GPU-accelerated Weighted Finite State Transducer (WFST) beam decoder compatible with current CTC models.
It increases pipeline throughput and decreases latency, supports streaming inference, and also supports advanced features like utterance-specific word boosting via on-the-fly composition.
arXiv Detail & Related papers (2023-11-08T19:57:10Z) - A Token-Wise Beam Search Algorithm for RNN-T [3.682821163882332]
We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps.
In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output.
arXiv Detail & Related papers (2023-02-28T07:20:49Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - Adding Connectionist Temporal Summarization into Conformer to Improve
Its Decoder Efficiency For Speech Recognition [22.61761934996406]
We propose a novel connectionist temporal summarization (CTS) method that reduces the number of frames required for the attention decoder.
With a beamwidth of 4, the LibriSpeech's decoding budget can be reduced by up to 20%.
The word error rate (WER) is reduced by 6% relative at the beam width of 1 and by 3% relative at the beam width of 4.
arXiv Detail & Related papers (2022-04-08T07:24:00Z) - Cascaded Fast and Slow Models for Efficient Semantic Code Search [46.53530668938728]
We propose an efficient and accurate semantic code search framework with cascaded fast and slow models.
The proposed cascaded approach is not only efficient and scalable, but also achieves state-of-the-art results.
arXiv Detail & Related papers (2021-10-15T02:23:35Z) - Fast-MD: Fast Multi-Decoder End-to-End Speech Translation with
Non-Autoregressive Hidden Intermediates [59.678108707409606]
We propose Fast-MD, a fast MD model that generates HI by non-autoregressive decoding based on connectionist temporal classification (CTC) outputs followed by an ASR decoder.
Fast-MD achieved about 2x and 4x faster decoding speed than that of the na"ive MD model on GPU and CPU with comparable translation quality.
arXiv Detail & Related papers (2021-09-27T05:21:30Z) - FSR: Accelerating the Inference Process of Transducer-Based Models by
Applying Fast-Skip Regularization [72.9385528828306]
A typical transducer model decodes the output sequence conditioned on the current acoustic state.
The number of blank tokens in the prediction results accounts for nearly 90% of all tokens.
We propose a method named fast-skip regularization, which tries to align the blank position predicted by a transducer with that predicted by a CTC model.
arXiv Detail & Related papers (2021-04-07T03:15:10Z) - Intermediate Loss Regularization for CTC-based Speech Recognition [58.33721897180646]
We present a simple and efficient auxiliary loss function for automatic speech recognition (ASR) based on the connectionist temporal classification ( CTC) objective.
We evaluate the proposed method on various corpora, reaching word error rate (WER) 9.9% on the WSJ corpus and character error rate (CER) 5.2% on the AISHELL-1 corpus respectively.
arXiv Detail & Related papers (2021-02-05T15:01:03Z) - End-to-end Sinkhorn Autoencoder with Noise Generator [10.008055997630304]
We propose a novel end-to-end sinkhorn autoencoder with noise generator for efficient data collection simulation.
Our method outperforms competing approaches on a challenging dataset of simulation data from Zero Degree Calorimeters of ALICE experiment in LHC.
arXiv Detail & Related papers (2020-06-11T18:04:10Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.