Parallel Rescoring with Transformer for Streaming On-Device Speech
Recognition
- URL: http://arxiv.org/abs/2008.13093v3
- Date: Wed, 2 Sep 2020 23:05:17 GMT
- Title: Parallel Rescoring with Transformer for Streaming On-Device Speech
Recognition
- Authors: Wei Li, James Qin, Chung-Cheng Chiu, Ruoming Pang, Yanzhang He
- Abstract summary: Two-pass model provides better speed-quality trade-offs for on-device speech recognition.
The 2nd-pass model plays a key role in the quality improvement of the end-to-end model to surpass the conventional model.
In this work we explore replacing the LSTM layers in the 2nd-pass rescorer with Transformer layers, which can process the entire hypothesis sequences in parallel.
- Score: 36.86458309520383
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances of end-to-end models have outperformed conventional models
through employing a two-pass model. The two-pass model provides better
speed-quality trade-offs for on-device speech recognition, where a 1st-pass
model generates hypotheses in a streaming fashion, and a 2nd-pass model
re-scores the hypotheses with full audio sequence context. The 2nd-pass model
plays a key role in the quality improvement of the end-to-end model to surpass
the conventional model. One main challenge of the two-pass model is the
computation latency introduced by the 2nd-pass model. Specifically, the
original design of the two-pass model uses LSTMs for the 2nd-pass model, which
are subject to long latency as they are constrained by the recurrent nature and
have to run inference sequentially. In this work we explore replacing the LSTM
layers in the 2nd-pass rescorer with Transformer layers, which can process the
entire hypothesis sequences in parallel and can therefore utilize the on-device
computation resources more efficiently. Compared with an LSTM-based baseline,
our proposed Transformer rescorer achieves more than 50% latency reduction with
quality improvement.
Related papers
- Towards Efficient and Real-Time Piano Transcription Using Neural Autoregressive Models [7.928003786376716]
We propose novel architectures for convolutional recurrent neural networks.
We improve note-state sequence modeling by using a pitchwise LSTM.
We show that the proposed models are comparable to state-of-the-art models in terms of note accuracy on the MAESTRO dataset.
arXiv Detail & Related papers (2024-04-10T08:06:15Z) - STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition [50.064502884594376]
We study the problem of human action recognition using motion capture (MoCap) sequences.
We propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences.
The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models.
arXiv Detail & Related papers (2023-03-31T16:19:27Z) - On Comparison of Encoders for Attention based End to End Speech
Recognition in Standalone and Rescoring Mode [1.7704011486040847]
Non-streaming models provide better performance as they look at the entire audio context.
We show that the Transformer model offers acceptable WER with the lowest latency requirements.
We highlight the importance of CNN front-end with Transformer architecture to achieve comparable word error rates (WER)
arXiv Detail & Related papers (2022-06-26T09:12:27Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - TSNAT: Two-Step Non-Autoregressvie Transformer Models for Speech
Recognition [69.68154370877615]
The non-autoregressive (NAR) models can get rid of the temporal dependency between the output tokens and predict the entire output tokens in at least one step.
To address these two problems, we propose a new model named the two-step non-autoregressive transformer(TSNAT)
The results show that the TSNAT can achieve a competitive performance with the AR model and outperform many complicated NAR models.
arXiv Detail & Related papers (2021-04-04T02:34:55Z) - WaveCRN: An Efficient Convolutional Recurrent Neural Network for
End-to-end Speech Enhancement [31.236720440495994]
In this paper, we propose an efficient E2E SE model, termed WaveCRN.
In WaveCRN, the speech locality feature is captured by a convolutional neural network (CNN), while the temporal sequential property of the locality feature is modeled by stacked simple recurrent units (SRU)
In addition, in order to more effectively suppress the noise components in the input noisy speech, we derive a novel restricted feature masking (RFM) approach that performs enhancement on the feature maps in the hidden layers.
arXiv Detail & Related papers (2020-04-06T13:48:05Z) - A Streaming On-Device End-to-End Model Surpassing Server-Side
Conventional Model Quality and Latency [88.08721721440429]
We develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer.
We find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model.
arXiv Detail & Related papers (2020-03-28T05:00:33Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z) - High-Accuracy and Low-Latency Speech Recognition with Two-Head
Contextual Layer Trajectory LSTM Model [46.34788932277904]
We improve conventional hybrid LSTM acoustic models for high-accuracy and low-latency automatic speech recognition.
To achieve high accuracy, we use a contextual layer trajectory LSTM (cltLSTM), which decouples the temporal modeling and target classification tasks.
We further improve the training strategy with sequence-level teacher-student learning.
arXiv Detail & Related papers (2020-03-17T00:52:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.