Related papers: WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection

WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection

URL: http://arxiv.org/abs/2505.13765v1
Date: Mon, 19 May 2025 22:48:01 GMT
Title: WIND: Accelerated RNN-T Decoding with Windowed Inference for Non-blank Detection
Authors: Hainan Xu, Vladimir Bataev, Lilit Grigoryan, Boris Ginsburg,
Abstract summary: Windowed Inference for Non-blank Detection (WIND) is a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy.<n>We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method.
Score: 17.950722198543897
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Windowed Inference for Non-blank Detection (WIND), a novel strategy that significantly accelerates RNN-T inference without compromising model accuracy. During model inference, instead of processing frames sequentially, WIND processes multiple frames simultaneously within a window in parallel, allowing the model to quickly locate non-blank predictions during decoding, resulting in significant speed-ups. We implement WIND for greedy decoding, batched greedy decoding with label-looping techniques, and also propose a novel beam-search decoding method. Experiments on multiple datasets with different conditions show that our method, when operating in greedy modes, speeds up as much as 2.4X compared to the baseline sequential approach while maintaining identical Word Error Rate (WER) performance. Our beam-search algorithm achieves slightly better accuracy than alternative methods, with significantly improved speed. We will open-source our WIND implementation.

Related papers

A Differential Evolution Algorithm with Neighbor-hood Mutation for DOA Estimation [11.842677286643609]
Two-dimensional (2D) Multiple Signal Classification algorithm is a powerful technique for high-resolution direction-of-arrival (DOA) estimation in array signal processing.<n>We reformulate the peak-finding process as a multimodal optimization prob-lem, and propose a Differential Evolu-tion algorithm with Neighborhood Mutation (DE-NM) to efficiently lo-cate multiple spectral peaks.
arXiv Detail & Related papers (2025-07-08T14:30:01Z)
Pushing the Limits of Beam Search Decoding for Transducer-based ASR models [18.41716157723428]
beam search significantly slows down Transducers due to repeated evaluations of key network components.<n>This paper introduces a universal method to accelerate beam search for Transducers, enabling the implementation of two optimized algorithms: ALSD++ and AES++.
arXiv Detail & Related papers (2025-05-30T19:42:48Z)
An Efficient Diffusion-based Non-Autoregressive Solver for Traveling Salesman Problem [21.948190231334088]
We propose DEITSP, a diffusion model with efficient iterations tailored for Traveling Salesman Problems.<n>We introduce a one-step diffusion model that integrates the controlled discrete noise addition process with self-consistency enhancement.<n>We also design a dual-modality graph transformer to bolster the extraction and fusion of features from node and edge modalities.
arXiv Detail & Related papers (2025-01-23T15:47:04Z)
Faster WIND: Accelerating Iterative Best-of-$N$ Distillation for LLM Alignment [81.84950252537618]
This paper reveals a unified game-theoretic connection between iterative BOND and self-play alignment.<n>We establish a novel framework, WIN rate Dominance (WIND), with a series of efficient algorithms for regularized win rate dominance optimization.
arXiv Detail & Related papers (2024-10-28T04:47:39Z)
Fast Chain-of-Thought: A Glance of Future from Parallel Decoding Leads to Answers Faster [61.83949316226113]
FastCoT is a model-agnostic framework based on parallel decoding. We show that FastCoT saves inference time by nearly 20% with only a negligible performance drop compared to the regular approach.
arXiv Detail & Related papers (2023-11-14T15:56:18Z)
DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion [137.8749239614528]
We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video.
arXiv Detail & Related papers (2023-03-27T00:40:52Z)
A Token-Wise Beam Search Algorithm for RNN-T [3.682821163882332]
We present a decoding beam search algorithm that batches the joint network calls across a segment of time steps. In addition, aggregating emission probabilities over a segment may be seen as a better approximation to finding the most likely model output.
arXiv Detail & Related papers (2023-02-28T07:20:49Z)
Rapid Person Re-Identification via Sub-space Consistency Regularization [51.76876061721556]
Person Re-Identification (ReID) matches pedestrians across disjoint cameras. Existing ReID methods adopting real-value feature descriptors have achieved high accuracy, but they are low in efficiency due to the slow Euclidean distance computation. We propose a novel Sub-space Consistency Regularization (SCR) algorithm that can speed up the ReID procedure by 0.25$ times.
arXiv Detail & Related papers (2022-07-13T02:44:05Z)
SwiftLane: Towards Fast and Efficient Lane Detection [0.8972186395640678]
We propose SwiftLane: a light-weight, end-to-end deep learning based framework, coupled with the row-wise classification formulation for fast and efficient lane detection. Our method achieves an inference speed of 411 frames per second, surpassing state-of-the-art in terms of speed while achieving comparable results in terms of accuracy on the popular CULane benchmark dataset.
arXiv Detail & Related papers (2021-10-22T13:35:05Z)
Nesterov Accelerated ADMM for Fast Diffeomorphic Image Registration [63.15453821022452]
Recent developments in approaches based on deep learning have achieved sub-second runtimes for DiffIR. We propose a simple iterative scheme that functionally composes intermediate non-stationary velocity fields. We then propose a convex optimisation model that uses a regularisation term of arbitrary order to impose smoothness on these velocity fields.
arXiv Detail & Related papers (2021-09-26T19:56:45Z)
A Real-time Action Representation with Temporal Encoding and Deep Compression [115.3739774920845]
We propose a new real-time convolutional architecture, called Temporal Convolutional 3D Network (T-C3D), for action representation. T-C3D learns video action representations in a hierarchical multi-granularity manner while obtaining a high process speed. Our method achieves clear improvements on UCF101 action recognition benchmark against state-of-the-art real-time methods by 5.4% in terms of accuracy and 2 times faster in terms of inference speed with a less than 5MB storage model.
arXiv Detail & Related papers (2020-06-17T06:30:43Z)

This list is automatically generated from the titles and abstracts of the papers in this site.