Improving the Training Recipe for a Robust Conformer-based Hybrid Model
- URL: http://arxiv.org/abs/2206.12955v1
- Date: Sun, 26 Jun 2022 20:01:08 GMT
- Title: Improving the Training Recipe for a Robust Conformer-based Hybrid Model
- Authors: Mohammad Zeineldeen and Jingjing Xu and Christoph L\"uscher and Ralf
Schl\"uter and Hermann Ney
- Abstract summary: We investigate various methods for speaker adaptive training (SAT) based on feature-space approaches for a conformer-based acoustic model (AM)
We propose a method, called weighted-Simple-Add, which adds weighted speaker information vectors to the input of the multi-head self-attention module of the conformer AM.
We extend and improve this recipe where we achieve 11% relative improvement in terms of word-error-rate (WER) on Switchboard 300h Hub5'00 dataset.
- Score: 46.78701739177677
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Speaker adaptation is important to build robust automatic speech recognition
(ASR) systems. In this work, we investigate various methods for speaker
adaptive training (SAT) based on feature-space approaches for a conformer-based
acoustic model (AM) on the Switchboard 300h dataset. We propose a method,
called Weighted-Simple-Add, which adds weighted speaker information vectors to
the input of the multi-head self-attention module of the conformer AM. Using
this method for SAT, we achieve 3.5% and 4.5% relative improvement in terms of
WER on the CallHome part of Hub5'00 and Hub5'01 respectively. Moreover, we
build on top of our previous work where we proposed a novel and competitive
training recipe for a conformer-based hybrid AM. We extend and improve this
recipe where we achieve 11% relative improvement in terms of word-error-rate
(WER) on Switchboard 300h Hub5'00 dataset. We also make this recipe efficient
by reducing the total number of parameters by 34% relative.
Related papers
- MLCA-AVSR: Multi-Layer Cross Attention Fusion based Audio-Visual Speech Recognition [62.89464258519723]
We propose a multi-layer cross-attention fusion based AVSR approach that promotes representation of each modality by fusing them at different levels of audio/visual encoders.
Our proposed approach surpasses the first-place system, establishing a new SOTA cpCER of 29.13% on this dataset.
arXiv Detail & Related papers (2024-01-07T08:59:32Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Conformer-based Hybrid ASR System for Switchboard Dataset [99.88988282353206]
We present and evaluate a competitive conformer-based hybrid model training recipe.
We study different training aspects and methods to improve word-error-rate as well as to increase training speed.
We conduct experiments on Switchboard 300h dataset and our conformer-based hybrid model achieves competitive results.
arXiv Detail & Related papers (2021-11-05T12:03:18Z) - A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z) - On the limit of English conversational speech recognition [28.395662280898787]
We show that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition.
We reduce the recognition errors of our LSTM system on Switchboard-300 by 4% relative.
We report 5.9% and 11.5% WER on the SWB and CHM parts of Hub5'00 with very simple LSTM models.
arXiv Detail & Related papers (2021-05-03T16:32:38Z) - Bayesian Learning for Deep Neural Network Adaptation [57.70991105736059]
A key task for speech recognition systems is to reduce the mismatch between training and evaluation data that is often attributable to speaker differences.
Model-based speaker adaptation approaches often require sufficient amounts of target speaker data to ensure robustness.
This paper proposes a full Bayesian learning based DNN speaker adaptation framework to model speaker-dependent (SD) parameter uncertainty.
arXiv Detail & Related papers (2020-12-14T12:30:41Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.