A Conformer Based Acoustic Model for Robust Automatic Speech Recognition
- URL: http://arxiv.org/abs/2203.00725v1
- Date: Tue, 1 Mar 2022 20:17:31 GMT
- Title: A Conformer Based Acoustic Model for Robust Automatic Speech Recognition
- Authors: Yufeng Yang, Peidong Wang, DeLiang Wang
- Abstract summary: The proposed model builds on a state-of-the-art recognition system using a bi-directional long short-term memory (BLSTM) model with utterance-wise dropout and iterative speaker adaptation.
The Conformer encoder uses a convolution-augmented attention mechanism for acoustic modeling.
The proposed system is evaluated on the monaural ASR task of the CHiME-4 corpus.
- Score: 63.242128956046024
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This study addresses robust automatic speech recognition (ASR) by introducing
a Conformer-based acoustic model. The proposed model builds on a
state-of-the-art recognition system using a bi-directional long short-term
memory (BLSTM) model with utterance-wise dropout and iterative speaker
adaptation, but employs a Conformer encoder instead of the BLSTM network. The
Conformer encoder uses a convolution-augmented attention mechanism for acoustic
modeling. The proposed system is evaluated on the monaural ASR task of the
CHiME-4 corpus. Coupled with utterance-wise normalization and speaker
adaptation, our model achieves $6.25\%$ word error rate, which outperforms the
previous best system by $8.4\%$ relatively. In addition, the proposed
Conformer-based model is $18.3\%$ smaller in model size and reduces training
time by $88.5\%$.
Related papers
- Enhancing Quantised End-to-End ASR Models via Personalisation [12.971231464928806]
We propose a novel strategy of personalisation for a quantised model (PQM)
PQM uses a 4-bit NormalFloat Quantisation (NF4) approach for model quantisation and low-rank adaptation (LoRA) for SAT.
Experiments have been performed on the LibriSpeech and the TED-LIUM 3 corpora.
arXiv Detail & Related papers (2023-09-17T02:35:21Z) - Continual Learning for On-Device Speech Recognition using Disentangled
Conformers [54.32320258055716]
We introduce a continual learning benchmark for speaker-specific domain adaptation derived from LibriVox audiobooks.
We propose a novel compute-efficient continual learning algorithm called DisentangledCL.
Our experiments show that the DisConformer models significantly outperform baselines on general ASR.
arXiv Detail & Related papers (2022-12-02T18:58:51Z) - Rapid Connectionist Speaker Adaptation [3.00476084358666]
We present SVCnet, a system for modelling speaker variability.
Nerve Networks specialized for each speech sound produce low dimensionality models of acoustical variation.
A training procedure is described which minimizes the dependence of this model on which sounds have been uttered.
arXiv Detail & Related papers (2022-11-15T00:15:11Z) - Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings [53.11450530896623]
This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize "who spoke what"
Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion.
The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.
arXiv Detail & Related papers (2022-03-30T21:42:00Z) - Exploiting Cross Domain Acoustic-to-articulatory Inverted Features For
Disordered Speech Recognition [57.15942628305797]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems for normal speech.
This paper presents a cross-domain acoustic-to-articulatory (A2A) inversion approach that utilizes the parallel acoustic-articulatory data of the 15-hour TORGO corpus in model training.
Cross-domain adapted to the 102.7-hour UASpeech corpus and to produce articulatory features.
arXiv Detail & Related papers (2022-03-19T08:47:18Z) - Mitigating Closed-model Adversarial Examples with Bayesian Neural
Modeling for Enhanced End-to-End Speech Recognition [18.83748866242237]
We focus on a rigorous and empirical "closed-model adversarial robustness" setting.
We propose an advanced Bayesian neural network (BNN) based adversarial detector.
We improve detection rate by +2.77 to +5.42% (relative +3.03 to +6.26%) and reduce the word error rate by 5.02 to 7.47% on LibriSpeech datasets.
arXiv Detail & Related papers (2022-02-17T09:17:58Z) - A Unified Speaker Adaptation Approach for ASR [37.76683818356052]
We propose a unified speaker adaptation approach consisting of feature adaptation and model adaptation.
For feature adaptation, we employ a speaker-aware persistent memory model which generalizes better to unseen test speakers.
For model adaptation, we use a novel gradual pruning method to adapt to target speakers without changing the model architecture.
arXiv Detail & Related papers (2021-10-16T10:48:52Z) - Residual Adapters for Parameter-Efficient ASR Adaptation to Atypical and
Accented Speech [5.960279280033886]
We show that by adding a relatively small number of extra parameters to the encoder layers via so-called residual adapter, we can achieve similar adaptation gains compared to model fine-tuning.
We demonstrate this on two speech adaptation tasks (atypical and accented speech) and for two state-of-the-art ASR architectures.
arXiv Detail & Related papers (2021-09-14T20:04:47Z) - Any-to-Many Voice Conversion with Location-Relative Sequence-to-Sequence
Modeling [61.351967629600594]
This paper proposes an any-to-many location-relative, sequence-to-sequence (seq2seq), non-parallel voice conversion approach.
In this approach, we combine a bottle-neck feature extractor (BNE) with a seq2seq synthesis module.
Objective and subjective evaluations show that the proposed any-to-many approach has superior voice conversion performance in terms of both naturalness and speaker similarity.
arXiv Detail & Related papers (2020-09-06T13:01:06Z) - Deliberation Model Based Two-Pass End-to-End Speech Recognition [52.45841282906516]
A two-pass model has been proposed to rescore streamed hypotheses using the non-streaming Listen, Attend and Spell (LAS) model.
The model attends to acoustics to rescore hypotheses, as opposed to a class of neural correction models that use only first-pass text hypotheses.
A bidirectional encoder is used to extract context information from first-pass hypotheses.
arXiv Detail & Related papers (2020-03-17T22:01:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.