Arabic Speech Recognition by End-to-End, Modular Systems and Human
- URL: http://arxiv.org/abs/2101.08454v1
- Date: Thu, 21 Jan 2021 05:55:29 GMT
- Title: Arabic Speech Recognition by End-to-End, Modular Systems and Human
- Authors: Amir Hussein, Shinji Watanabe, Ahmed Ali
- Abstract summary: We perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition.
For ASR the end-to-end work led to 12.5%, 27.5%, 23.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively.
Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.6% on average.
- Score: 56.96327247226586
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in automatic speech recognition (ASR) have achieved accuracy
levels comparable to human transcribers, which led researchers to debate if the
machine has reached human performance. Previous work focused on the English
language and modular hidden Markov model-deep neural network (HMM-DNN) systems.
In this paper, we perform a comprehensive benchmarking for end-to-end
transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the
Arabic language and its dialects. For the HSR, we evaluate linguist performance
and lay-native speaker performance on a new dataset collected as a part of this
study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new
performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our
results suggest that human performance in the Arabic language is still
considerably better than the machine with an absolute WER gap of 3.6% on
average.
Related papers
- Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - Automatic Speech Recognition Advancements for Indigenous Languages of the Americas [0.0]
The Second Americas (Americas Natural Language Processing) Competition Track 1 of NeurIPS (Neural Information Processing Systems) 2022 proposed the task of training automatic speech recognition systems for five Indigenous languages: Quechua, Guarani, Bribri, Kotiria, and Wa'ikhana.
We describe the fine-tuning of a state-of-the-art ASR model for each target language, using approximately 36.65 h of transcribed speech data from diverse sources enriched with data augmentation methods.
We release our best models for each language, marking the first open ASR models for Wa'ikhana and Kotiria.
arXiv Detail & Related papers (2024-04-12T10:12:38Z) - Employing Hybrid Deep Neural Networks on Dari Speech [0.0]
This article focuses on the recognition of individual words in the Dari language using the Mel-frequency cepstral coefficients (MFCCs) feature extraction method.
We evaluate three different deep neural network models: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Multilayer Perceptron (MLP)
arXiv Detail & Related papers (2023-05-04T23:10:53Z) - From English to More Languages: Parameter-Efficient Model Reprogramming
for Cross-Lingual Speech Recognition [50.93943755401025]
We propose a new parameter-efficient learning framework based on neural model reprogramming for cross-lingual speech recognition.
We design different auxiliary neural architectures focusing on learnable pre-trained feature enhancement.
Our methods outperform existing ASR tuning architectures and their extension with self-supervised losses.
arXiv Detail & Related papers (2023-01-19T02:37:56Z) - Analyzing And Improving Neural Speaker Embeddings for ASR [54.30093015525726]
We present our efforts w.r.t integrating neural speaker embeddings into a conformer based hybrid HMM ASR system.
Our best Conformer-based hybrid ASR system with speaker embeddings achieves 9.0% WER on Hub5'00 and Hub5'01 with training on SWB 300h.
arXiv Detail & Related papers (2023-01-11T16:56:03Z) - Finnish Parliament ASR corpus - Analysis, benchmarks and statistics [11.94655679070282]
The Finnish parliament is the largest publicly available collection of manually transcribed speech data for Finnish with over 3000 hours of speech and 449 speakers.
This corpus builds on earlier initial work, and as a result the corpus has a natural split into two training subsets from two periods of time.
We develop a complete Kaldi-based data preparation pipeline, and hidden Markov model (HMM), hybrid deep neural network (HMM-DNN) and attention-based encoder-decoder (AED) ASR recipes.
arXiv Detail & Related papers (2022-03-28T16:29:49Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Domain Adversarial Neural Networks for Dysarthric Speech Recognition [21.550420336634726]
This work explores domain adversarial neural networks (DANN) for speaker-independent speech recognition.
The classification task on 10 spoken digits is performed using an end-to-end CNN taking raw audio as input.
Experiments conducted in this paper show that DANN achieves an absolute recognition rate of 74.91% and outperforms the baseline by 12.18%.
arXiv Detail & Related papers (2020-10-07T19:51:41Z) - You Do Not Need More Data: Improving End-To-End Speech Recognition by
Text-To-Speech Data Augmentation [59.31769998728787]
We build our TTS system on an ASR training database and then extend the data with synthesized speech to train a recognition model.
Our system establishes a competitive result for end-to-end ASR trained on LibriSpeech train-clean-100 set with WER 4.3% for test-clean and 13.5% for test-other.
arXiv Detail & Related papers (2020-05-14T17:24:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.