Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers
- URL: http://arxiv.org/abs/2104.09426v1
- Date: Mon, 19 Apr 2021 16:18:00 GMT
- Title: Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers
- Authors: Takaaki Hori, Niko Moritz, Chiori Hori, Jonathan Le Roux
- Abstract summary: We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
- Score: 56.56220390953412
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses end-to-end automatic speech recognition (ASR) for long
audio recordings such as lecture and conversational speeches. Most end-to-end
ASR models are designed to recognize independent utterances, but contextual
information (e.g., speaker or topic) over multiple utterances is known to be
useful for ASR. In our prior work, we proposed a context-expanded Transformer
that accepts multiple consecutive utterances at the same time and predicts an
output sequence for the last utterance, achieving 5-15% relative error
reduction from utterance-based baselines in lecture and conversational ASR
benchmarks. Although the results have shown remarkable performance gain, there
is still potential to further improve the model architecture and the decoding
process. In this paper, we extend our prior work by (1) introducing the
Conformer architecture to further improve the accuracy, (2) accelerating the
decoding process with a novel activation recycling technique, and (3) enabling
streaming decoding with triggered attention. We demonstrate that the extended
Transformer provides state-of-the-art end-to-end ASR performance, obtaining a
17.3% character error rate for the HKUST dataset and 12.0%/6.3% word error
rates for the Switchboard-300 Eval2000 CallHome/Switchboard test sets. The new
decoding method reduces decoding time by more than 50% and further enables
streaming ASR with limited accuracy degradation.
Related papers
- Using External Off-Policy Speech-To-Text Mappings in Contextual
End-To-End Automated Speech Recognition [19.489794740679024]
We investigate the potential of leveraging external knowledge, particularly through off-policy key-value stores generated with text-to-speech methods.
In our approach, audio embeddings captured from text-to-speech, along with semantic text embeddings, are used to bias ASR.
Experiments on LibiriSpeech and in-house voice assistant/search datasets show that the proposed approach can reduce domain adaptation time by up to 1K GPU-hours.
arXiv Detail & Related papers (2023-01-06T22:32:50Z) - Label-Synchronous Speech-to-Text Alignment for ASR Using Forward and
Backward Transformers [49.403414751667135]
This paper proposes a novel label-synchronous speech-to-text alignment technique for automatic speech recognition (ASR)
The proposed method re-defines the speech-to-text alignment as a label-synchronous text mapping problem.
Experiments using the corpus of spontaneous Japanese (CSJ) demonstrate that the proposed method provides an accurate utterance-wise alignment.
arXiv Detail & Related papers (2021-04-21T03:05:12Z) - Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation [11.52842516726486]
We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
arXiv Detail & Related papers (2021-03-17T21:02:36Z) - Data Augmentation for End-to-end Code-switching Speech Recognition [54.0507000473827]
Three novel approaches are proposed for code-switching data augmentation.
Audio splicing with the existing code-switching data, and TTS with new code-switching texts generated by word translation or word insertion.
Experiments on 200 hours Mandarin-English code-switching dataset show significant improvements on code-switching ASR individually.
arXiv Detail & Related papers (2020-11-04T07:12:44Z) - Sequence-to-Sequence Learning via Attention Transfer for Incremental
Speech Recognition [25.93405777713522]
We investigate whether it is possible to employ the original architecture of attention-based ASR for ISR tasks.
We design an alternative student network that, instead of using a thinner or a shallower model, keeps the original architecture of the teacher model but with shorter sequences.
Our experiments show that by delaying the starting time of recognition process with about 1.7 sec, we can achieve comparable performance to one that needs to wait until the end.
arXiv Detail & Related papers (2020-11-04T05:06:01Z) - FastEmit: Low-latency Streaming ASR with Sequence-level Emission
Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible.
Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models.
We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z) - Adapting End-to-End Speech Recognition for Readable Subtitles [15.525314212209562]
In some use cases such as subtitling, verbatim transcription would reduce output readability given limited screen size and reading time.
We first investigate a cascaded system, where an unsupervised compression model is used to post-edit the transcribed speech.
Experiments show that with limited data far less than needed for training a model from scratch, we can adapt a Transformer-based ASR model to incorporate both transcription and compression capabilities.
arXiv Detail & Related papers (2020-05-25T14:42:26Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.