Related papers: Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition

URL: http://arxiv.org/abs/2008.13093v3
Date: Wed, 2 Sep 2020 23:05:17 GMT
Title: Parallel Rescoring with Transformer for Streaming On-Device Speech Recognition
Authors: Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He
Abstract summary: Two-pass model provides better speed-quality trade-offs for on-device speech recognition. The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel.
Score: 36.86458309520383
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advances of end-to-end models have outperformed conventional models through employing a two-pass model. The two-pass model provides better speed-quality trade-offs for on-device speech recognition, where a 1st-pass model generates hypotheses in a streaming fashion, and a 2nd-pass model re-scores the hypotheses with full audio sequence context. The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model. One main challenge of the two-pass model is the computation latency introduced by the 2nd-pass model. Specifically, the original design of the two-pass model uses LSTMs for the 2nd-pass model, which are subject to long latency as they are constrained by the recurrent nature and have to run inference sequentially. In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel and can therefore utilize the on-device computation resources more efficiently. Compared with an LSTM-based baseline, our proposed Transformer rescorer achieves more than 50% latency reduction with quality improvement.

Related papers

M2Rec: Multi-scale Mamba for Efficient Sequential Recommendation [35.508076394809784]
model is a novel sequential recommendation framework that integrates multi-scale Mamba with Fourier analysis, Large Language Models, and adaptive gating.<n>Experiments demonstrate that model achieves state-of-the-art performance, improving Hit Rate@10 by 3.2% over existing Mamba-based models.
arXiv Detail & Related papers (2025-05-07T14:14:29Z)
Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models [7.928003786376716]
We propose novel architectures for convolutional recurrent neural networks. We improve note-state sequence modeling by using a pitchwise LSTM. We show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset.
arXiv Detail & Related papers (2024-04-10T08:06:15Z)
STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences. We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z)
On Comparison of Encoders for Attention based End to End Speech Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context. We show that the Transformer model offers acceptable WER with the lowest latency requirements. We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z)
Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition. The t-SOT model has the advantages of less inference cost and a simpler model architecture. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z)
TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step. To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT) The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z)
WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN. In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU) In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z)
A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer. We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z)
Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model. The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses. A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
High-Accuracy and Low-Latency Speech Recognition with Two-Head Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition. To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks. We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.