Sequence-to-Sequence Learning via Attention Transfer for Incremental
Speech Recognition
- URL: http://arxiv.org/abs/2011.02127v1
- Date: Wed, 4 Nov 2020 05:06:01 GMT
- Title: Sequence-to-Sequence Learning via Attention Transfer for Incremental
Speech Recognition
- Authors: Sashi Novitasari, Andros Tjandra, Sakriani Sakti, Satoshi Nakamura
- Abstract summary: We investigate whether it is possible to employ the original architecture of attention-based ASR for ISR tasks.
We design an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences.
Our experiments show that by delaying the starting time of recognition process with about 1.7 sec, we can achieve comparable performance to one that needs to wait until the end.
- Score: 25.93405777713522
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Attention-based sequence-to-sequence automatic speech recognition (ASR)
requires a significant delay to recognize long utterances because the output is
generated after receiving entire input sequences. Although several studies
recently proposed sequence mechanisms for incremental speech recognition (ISR),
using different frameworks and learning algorithms is more complicated than the
standard ASR model. One main reason is because the model needs to decide the
incremental steps and learn the transcription that aligns with the current
short speech segment. In this work, we investigate whether it is possible to
employ the original architecture of attention-based ASR for ISR tasks by
treating a full-utterance ASR as the teacher model and the ISR as the student
model. We design an alternative student network that, instead of using a
thinner or a shallower model, keeps the original architecture of the teacher
model but with shorter sequences (few encoder and decoder states). Using
attention transfer, the student network learns to mimic the same alignment
between the current input short speech segments and the transcription. Our
experiments show that by delaying the starting time of recognition process with
about 1.7 sec, we can achieve comparable performance to one that needs to wait
until the end.
Related papers
- Streaming Speech-to-Confusion Network Speech Recognition [19.720334657478475]
We present a novel streaming ASR architecture that outputs a confusion network while maintaining limited latency.
We show that 1-best results of our model are on par with a comparable RNN-T system.
We also show that our model outperforms a strong RNN-T baseline on a far-field voice assistant task.
arXiv Detail & Related papers (2023-06-02T20:28:14Z) - Extended Graph Temporal Classification for Multi-Speaker End-to-End ASR [77.82653227783447]
We propose an extension of GTC to model the posteriors of both labels and label transitions by a neural network.
As an example application, we use the extended GTC (GTC-e) for the multi-speaker speech recognition task.
arXiv Detail & Related papers (2022-03-01T05:02:02Z) - SRU++: Pioneering Fast Recurrence with Attention for Speech Recognition [49.42625022146008]
We present the advantages of applying SRU++ in ASR tasks by comparing with Conformer across multiple ASR benchmarks.
Specifically, SRU++ can surpass Conformer on long-form speech input with a large margin, based on our analysis.
arXiv Detail & Related papers (2021-10-11T19:23:50Z) - Streaming End-to-End ASR based on Blockwise Non-Autoregressive Models [57.20432226304683]
Non-autoregressive (NAR) modeling has gained more and more attention in speech processing.
We propose a novel end-to-end streaming NAR speech recognition system.
We show that the proposed method improves online ASR recognition in low latency conditions.
arXiv Detail & Related papers (2021-07-20T11:42:26Z) - Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech
Recognition [58.69803243323346]
Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks.
However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR.
We present the dual causal/non-causal self-attention architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer.
arXiv Detail & Related papers (2021-07-02T20:56:13Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.