Related papers: Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text

URL: http://arxiv.org/abs/2602.24245v1
Date: Fri, 27 Feb 2026 18:17:10 GMT
Title: Chunk-wise Attention Transducers for Fast and Accurate Streaming Speech-to-Text
Authors: Hainan Xu, Vladimir Bataev, Travis M. Bartley, Jagadeesh Balam,
Abstract summary: Chunk-wise Attention Transducer (CHAT) is a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk.<n>Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.
Score: 19.1160706519659
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose Chunk-wise Attention Transducer (CHAT), a novel extension to RNN-T models that processes audio in fixed-size chunks while employing cross-attention within each chunk. This hybrid approach maintains RNN-T's streaming capability while introducing controlled flexibility for local alignment modeling. CHAT significantly reduces the temporal dimension that RNN-T must handle, yielding substantial efficiency improvements: up to 46.2% reduction in peak training memory, up to 1.36X faster training, and up to 1.69X faster inference. Alongside these efficiency gains, CHAT achieves consistent accuracy improvements over RNN-T across multiple languages and tasks -- up to 6.3% relative WER reduction for speech recognition and up to 18.0% BLEU improvement for speech translation. The method proves particularly effective for speech translation, where RNN-T's strict monotonic alignment hurts performance. Our results demonstrate that the CHAT model offers a practical solution for deploying more capable streaming speech models without sacrificing real-time constraints.

Related papers

Speech-FT: Merging Pre-trained And Fine-Tuned Speech Representation Models For Cross-Task Generalization [51.56024241398741]
Fine-tuning speech representation models can enhance performance on specific tasks but often compromises cross-task generalization ability.<n>Existing approaches, such as regularizing weight changes during fine-tuning, may fail to maintain sufficiently high feature similarity with the pre-trained model.<n>We propose Speech-FT, a novel two-stage fine-tuning framework designed to maintain cross-task generalization while benefiting from fine-tuning.
arXiv Detail & Related papers (2025-02-18T09:23:42Z)
Dynamic Latency for CTC-Based Streaming Automatic Speech Recognition With Emformer [0.4588028371034407]
A frame-level model using efficient augment memory transformer block and dynamic latency training method is employed for streaming automatic speech recognition. With an average latency of 640ms, our model achieves a relative WER reduction of 6.4% on test-clean and 3.0% on test-other versus the truncate chunk-wise Transformer.
arXiv Detail & Related papers (2022-03-29T14:31:06Z)
Speech-enhanced and Noise-aware Networks for Robust Speech Recognition [25.279902171523233]
A noise-aware training framework based on two cascaded neural structures is proposed to jointly optimize speech enhancement and speech recognition. The two proposed systems achieve word error rate (WER) of 3.90% and 3.55%, respectively, on the Aurora-4 task. Compared with the best existing systems that use bigram and trigram language models for decoding, the proposed CNN-TDNNF-based system achieves a relative WER reduction of 15.20% and 33.53%, respectively.
arXiv Detail & Related papers (2022-03-25T15:04:51Z)
On Addressing Practical Challenges for RNN-Transduce [72.72132048437751]
We adapt a well-trained RNN-T model to a new domain without collecting the audio data. We obtain word-level confidence scores by utilizing several types of features calculated during decoding. The proposed time stamping method can get less than 50ms word timing difference on average.
arXiv Detail & Related papers (2021-04-27T23:31:43Z)
Deep Time Delay Neural Network for Speech Enhancement with Full Data Learning [60.20150317299749]
This paper proposes a deep time delay neural network (TDNN) for speech enhancement with full data learning. To make full use of the training data, we propose a full data learning method for speech enhancement.
arXiv Detail & Related papers (2020-11-11T06:32:37Z)
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)
Speaker Representation Learning using Global Context Guided Channel and Time-Frequency Transformations [67.18006078950337]
We use the global context information to enhance important channels and recalibrate salient time-frequency locations. The proposed modules, together with a popular ResNet based model, are evaluated on the VoxCeleb1 dataset.
arXiv Detail & Related papers (2020-09-02T01:07:29Z)
RNN-T Models Fail to Generalize to Out-of-Domain Audio: Causes and Solutions [73.45995446500312]
We analyze the generalization properties of streaming and non-streaming recurrent neural network transducer (RNN-T) based end-to-end models. We propose two solutions: combining multiple regularization techniques during training, and using dynamic overlapping inference.
arXiv Detail & Related papers (2020-05-07T06:24:47Z)
Single Channel Speech Enhancement Using Temporal Convolutional Recurrent Neural Networks [23.88788382262305]
temporal convolutional recurrent network (TCRN) is an end-to-end model that directly map noisy waveform to clean waveform. We show that our model is able to improve the performance of model, compared with existing convolutional recurrent networks.
arXiv Detail & Related papers (2020-02-02T04:26:50Z)

This list is automatically generated from the titles and abstracts of the papers in this site.