Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
- URL: http://arxiv.org/abs/2506.00185v1
- Date: Fri, 30 May 2025 19:42:48 GMT
- Title: Pushing the Limits of Beam Search Decoding for Transducer-based ASR models
- Authors: Lilit Grigoryan, Vladimir Bataev, Andrei Andrusenko, Hainan Xu, Vitaly Lavrukhin, Boris Ginsburg,
- Abstract summary: beam search significantly slows down Transducers due to repeated evaluations of key network components.<n>This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++.
- Score: 18.41716157723428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transducer models have emerged as a promising choice for end-to-end ASR systems, offering a balanced trade-off between recognition accuracy, streaming capabilities, and inference speed in greedy decoding. However, beam search significantly slows down Transducers due to repeated evaluations of key network components, limiting practical applications. This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++. The proposed method utilizes batch operations, a tree-based hypothesis structure, novel blank scoring for enhanced shallow fusion, and CUDA graph execution for efficient GPU inference. This narrows the speed gap between beam and greedy modes to only 10-20% for the whole system, achieves 14-30% relative improvement in WER compared to greedy decoding, and improves shallow fusion for low-resource up to 11% compared to existing implementations. All the algorithms are open sourced.
Related papers
- NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z) - Reducing Storage of Pretrained Neural Networks by Rate-Constrained Quantization and Entropy Coding [56.066799081747845]
The ever-growing size of neural networks poses serious challenges on resource-constrained devices.<n>We propose a novel post-training compression framework that combines rate-aware quantization with entropy coding.<n>Our method allows for very fast decoding and is compatible with arbitrary quantization grids.
arXiv Detail & Related papers (2025-05-24T15:52:49Z) - Joint Transmit and Pinching Beamforming for Pinching Antenna Systems (PASS): Optimization-Based or Learning-Based? [89.05848771674773]
A novel antenna system ()-enabled downlink multi-user multiple-input single-output (MISO) framework is proposed.<n>It consists of multiple waveguides, which equip numerous low-cost antennas, named (PAs)<n>The positions of PAs can be reconfigured to both spanning large-scale path and space.
arXiv Detail & Related papers (2025-02-12T18:54:10Z) - Efficient Approximate Degenerate Ordered Statistics Decoding for Quantum Codes via Reliable Subset Reduction [5.625796693054094]
We introduce the concept of approximate degenerate decoding and integrate it with ordered statistics decoding (OSD)<n>We present an ADOSD algorithm that significantly improves OSD efficiency in the code capacity noise model.
arXiv Detail & Related papers (2024-12-30T17:45:08Z) - Sparsity-Constraint Optimization via Splicing Iteration [1.3622424109977902]
We develop an algorithm named Sparsity-Constraint Optimization via sPlicing itEration (SCOPE)
SCOPE converges effectively without tuning parameters.
We apply SCOPE to solve quadratic optimization, learn sparse classifiers, and recover sparse Markov networks for binary variables.
Our open-source Python package skscope based on C++ implementation is publicly available on GitHub.
arXiv Detail & Related papers (2024-06-17T18:34:51Z) - Label-Looping: Highly Efficient Decoding for Transducers [19.091932566833265]
This paper introduces a highly efficient greedy decoding algorithm for Transducer-based speech recognition models.
Experiments show that the label-looping algorithm is up to 2.0X faster than conventional batched decoding when using batch size 32.
arXiv Detail & Related papers (2024-06-10T12:34:38Z) - Improving Dual-Encoder Training through Dynamic Indexes for Negative
Mining [61.09807522366773]
We introduce an algorithm that approximates the softmax with provable bounds and that dynamically maintains the tree.
In our study on datasets with over twenty million targets, our approach cuts error by half in relation to oracle brute-force negative mining.
arXiv Detail & Related papers (2023-03-27T15:18:32Z) - A Token-Wise Beam Search Algorithm for RNN-T [3.682821163882332]
We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps.
In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output.
arXiv Detail & Related papers (2023-02-28T07:20:49Z) - Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras.
Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation.
We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z) - FastFlowNet: A Lightweight Network for Fast Optical Flow Estimation [81.76975488010213]
Dense optical flow estimation plays a key role in many robotic vision tasks.
Current networks often occupy large number of parameters and require heavy computation costs.
Our proposed FastFlowNet works in the well-known coarse-to-fine manner with following innovations.
arXiv Detail & Related papers (2021-03-08T03:09:37Z) - BRDS: An FPGA-based LSTM Accelerator with Row-Balanced Dual-Ratio
Sparsification [3.3711251611130337]
A hardware-friendly pruning algorithm for reducing energy consumption and improving the speed of Long Short-Term Memory (LSTM) neural network accelerators is presented.
Results show that the proposed accelerator could provide up to 272% higher effective GOPS/W and the perplexity error is reduced by up to 1.4% for the PTB dataset.
arXiv Detail & Related papers (2021-01-07T18:23:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.