Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation
- URL: http://arxiv.org/abs/2103.09903v1
- Date: Wed, 17 Mar 2021 21:02:36 GMT
- Title: Transformer-based ASR Incorporating Time-reduction Layer and Fine-tuning
with Self-Knowledge Distillation
- Authors: Md Akmal Haidar, Chao Xing, Mehdi Rezagholizadeh
- Abstract summary: We propose a Transformer-based ASR model with the time reduction layer, in which we incorporate time reduction layer inside transformer encoder layers.
We also introduce a fine-tuning approach for pre-trained ASR models using self-knowledge distillation (S-KD) which further improves the performance of our ASR model.
With language model (LM) fusion, we achieve new state-of-the-art word error rate (WER) results for Transformer-based ASR models.
- Score: 11.52842516726486
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: End-to-end automatic speech recognition (ASR), unlike conventional ASR, does
not have modules to learn the semantic representation from speech encoder.
Moreover, the higher frame-rate of speech representation prevents the model to
learn the semantic representation properly. Therefore, the models that are
constructed by the lower frame-rate of speech encoder lead to better
performance. For Transformer-based ASR, the lower frame-rate is not only
important for learning better semantic representation but also for reducing the
computational complexity due to the self-attention mechanism which has O(n^2)
order of complexity in both training and inference. In this paper, we propose a
Transformer-based ASR model with the time reduction layer, in which we
incorporate time reduction layer inside transformer encoder layers in addition
to traditional sub-sampling methods to input features that further reduce the
frame-rate. This can help in reducing the computational cost of the
self-attention process for training and inference with performance improvement.
Moreover, we introduce a fine-tuning approach for pre-trained ASR models using
self-knowledge distillation (S-KD) which further improves the performance of
our ASR model. Experiments on LibriSpeech datasets show that our proposed
methods outperform all other Transformer-based ASR systems. Furthermore, with
language model (LM) fusion, we achieve new state-of-the-art word error rate
(WER) results for Transformer-based ASR models with just 30 million parameters
trained without any external data.
Related papers
- Are Transformers in Pre-trained LM A Good ASR Encoder? An Empirical Study [52.91899050612153]
transformers within pre-trained language models (PLMs) when repurposed as encoders for Automatic Speech Recognition (ASR)
Our findings reveal a notable improvement in Character Error Rate (CER) and Word Error Rate (WER) across diverse ASR tasks when transformers from pre-trained LMs are incorporated.
This underscores the potential of leveraging the semantic prowess embedded within pre-trained transformers to advance ASR systems' capabilities.
arXiv Detail & Related papers (2024-09-26T11:31:18Z) - A Lexical-aware Non-autoregressive Transformer-based ASR Model [9.500518278458905]
We propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder.
LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge.
arXiv Detail & Related papers (2023-05-18T09:50:47Z) - Streaming Multi-Talker ASR with Token-Level Serialized Output Training [53.11450530896623]
t-SOT is a novel framework for streaming multi-talker automatic speech recognition.
The t-SOT model has the advantages of less inference cost and a simpler model architecture.
For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost.
arXiv Detail & Related papers (2022-02-02T01:27:21Z) - Attention-based Multi-hypothesis Fusion for Speech Summarization [83.04957603852571]
Speech summarization can be achieved by combining automatic speech recognition (ASR) and text summarization (TS)
ASR errors directly affect the quality of the output summary in the cascade approach.
We propose a cascade speech summarization model that is robust to ASR errors and that exploits multiple hypotheses generated by ASR to attenuate the effect of ASR errors on the summary.
arXiv Detail & Related papers (2021-11-16T03:00:29Z) - Cross-Modal Transformer-Based Neural Correction Models for Automatic
Speech Recognition [31.2558640840697]
We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition system.
Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
arXiv Detail & Related papers (2021-07-04T07:58:31Z) - N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR
Hypotheses [0.0]
Spoken Language Understanding (SLU) parses speech into semantic structures like dialog acts and slots.
We show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime.
arXiv Detail & Related papers (2021-06-11T17:29:00Z) - Advanced Long-context End-to-end Speech Recognition Using
Context-expanded Transformers [56.56220390953412]
We extend our prior work by introducing the Conformer architecture to further improve the accuracy.
We demonstrate that the extended Transformer provides state-of-the-art end-to-end ASR performance.
arXiv Detail & Related papers (2021-04-19T16:18:00Z) - Non-autoregressive Transformer-based End-to-end ASR using BERT [13.07939371864781]
This paper presents a transformer-based end-to-end automatic speech recognition (ASR) model based on BERT.
A series of experiments conducted on the AISHELL-1 dataset demonstrates competitive or superior results.
arXiv Detail & Related papers (2021-04-10T16:22:17Z) - Conv-Transformer Transducer: Low Latency, Low Frame Rate, Streamable
End-to-End Speech Recognition [8.046120977786702]
Transformer has achieved competitive performance against state-of-the-art end-to-end models in automatic speech recognition (ASR)
The original Transformer, with encoder-decoder architecture, is only suitable for offline ASR.
We show that this architecture, named Conv-Transformer Transducer, achieves competitive performance on LibriSpeech dataset (3.6% WER on test-clean) without external language models.
arXiv Detail & Related papers (2020-08-13T08:20:02Z) - Pretraining Techniques for Sequence-to-Sequence Voice Conversion [57.65753150356411]
Sequence-to-sequence (seq2seq) voice conversion (VC) models are attractive owing to their ability to convert prosody.
We propose to transfer knowledge from other speech processing tasks where large-scale corpora are easily available, typically text-to-speech (TTS) and automatic speech recognition (ASR)
We argue that VC models with such pretrained ASR or TTS model parameters can generate effective hidden representations for high-fidelity, highly intelligible converted speech.
arXiv Detail & Related papers (2020-08-07T11:02:07Z) - Streaming automatic speech recognition with the transformer model [59.58318952000571]
We propose a transformer based end-to-end ASR system for streaming ASR.
We apply time-restricted self-attention for the encoder and triggered attention for the encoder-decoder attention mechanism.
Our proposed streaming transformer architecture achieves 2.8% and 7.2% WER for the "clean" and "other" test data of LibriSpeech.
arXiv Detail & Related papers (2020-01-08T18:58:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.