Employing Hybrid Deep Neural Networks on Dari Speech
- URL: http://arxiv.org/abs/2305.03200v1
- Date: Thu, 4 May 2023 23:10:53 GMT
- Title: Employing Hybrid Deep Neural Networks on Dari Speech
- Authors: Jawid Ahmad Baktash and Mursal Dawodi
- Abstract summary: This article focuses on the recognition of individual words in the Dari language using the Mel-frequency cepstral coefficients (MFCCs) feature extraction method.
We evaluate three different deep neural network models: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Multilayer Perceptron (MLP)
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper is an extension of our previous conference paper. In recent years,
there has been a growing interest among researchers in developing and improving
speech recognition systems to facilitate and enhance human-computer
interaction. Today, Automatic Speech Recognition (ASR) systems have become
ubiquitous, used in everything from games to translation systems, robots, and
more. However, much research is still needed on speech recognition systems for
low-resource languages. This article focuses on the recognition of individual
words in the Dari language using the Mel-frequency cepstral coefficients
(MFCCs) feature extraction method and three different deep neural network
models: Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and
Multilayer Perceptron (MLP), as well as two hybrid models combining CNN and
RNN. We evaluate these models using an isolated Dari word corpus that we have
created, consisting of 1000 utterances for 20 short Dari terms. Our study
achieved an impressive average accuracy of 98.365%.
Related papers
- Training Neural Networks as Recognizers of Formal Languages [87.06906286950438]
Formal language theory pertains specifically to recognizers.
It is common to instead use proxy tasks that are similar in only an informal sense.
We correct this mismatch by training and evaluating neural networks directly as binary classifiers of strings.
arXiv Detail & Related papers (2024-11-11T16:33:25Z) - Keyword spotting -- Detecting commands in speech using deep learning [2.709166684084394]
We implement feature engineering by converting raw waveforms to Mel Frequency Cepstral Coefficients (MFCCs)
In our experiments, RNN with BiLSTM and Attention achieves the best performance with an accuracy of 93.9 %.
arXiv Detail & Related papers (2023-12-09T19:04:17Z) - Improved Contextual Recognition In Automatic Speech Recognition Systems
By Semantic Lattice Rescoring [4.819085609772069]
We propose a novel approach for enhancing contextual recognition within ASR systems via semantic lattice processing.
Our solution consists of using Hidden Markov Models and Gaussian Mixture Models (HMM-GMM) along with Deep Neural Networks (DNN) models for better accuracy.
We demonstrate the effectiveness of our proposed framework on the LibriSpeech dataset with empirical analyses.
arXiv Detail & Related papers (2023-10-14T23:16:05Z) - Canonical Cortical Graph Neural Networks and its Application for Speech
Enhancement in Future Audio-Visual Hearing Aids [0.726437825413781]
This paper proposes a more biologically plausible self-supervised machine learning approach that combines multimodal information using intra-layer modulations together with canonical correlation analysis (CCA)
The approach outperformed recent state-of-the-art results considering both better clean audio reconstruction and energy efficiency, described by a reduced and smother neuron firing rate distribution.
arXiv Detail & Related papers (2022-06-06T15:20:07Z) - Toward a realistic model of speech processing in the brain with
self-supervised learning [67.7130239674153]
Self-supervised algorithms trained on the raw waveform constitute a promising candidate.
We show that Wav2Vec 2.0 learns brain-like representations with as little as 600 hours of unlabelled speech.
arXiv Detail & Related papers (2022-06-03T17:01:46Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Real-time Speaker counting in a cocktail party scenario using
Attention-guided Convolutional Neural Network [60.99112031408449]
We propose a real-time, single-channel attention-guided Convolutional Neural Network (CNN) to estimate the number of active speakers in overlapping speech.
The proposed system extracts higher-level information from the speech spectral content using a CNN model.
Experiments on simulated overlapping speech using WSJ corpus show that the attention solution is shown to improve the performance by almost 3% absolute over conventional temporal average pooling.
arXiv Detail & Related papers (2021-10-30T19:24:57Z) - Is Attention always needed? A Case Study on Language Identification from
Speech [1.162918464251504]
The present study introduces convolutional recurrent neural network (CRNN) based LID.
CRNN based LID is designed to operate on the Mel-frequency Cepstral Coefficient (MFCC) characteristics of audio samples.
The LID model exhibits high-performance levels ranging from 97% to 100% for languages that are linguistically similar.
arXiv Detail & Related papers (2021-10-05T16:38:57Z) - On the Effectiveness of Neural Text Generation based Data Augmentation
for Recognition of Morphologically Rich Speech [0.0]
We have significantly improved the online performance of a conversational speech transcription system by transferring knowledge from a RNNLM to the single pass BNLM with text generation based data augmentation.
We show that using the RNN-BNLM in the first pass followed by a neural second pass, offline ASR results can be even significantly improved.
arXiv Detail & Related papers (2020-06-09T09:01:04Z) - Conformer: Convolution-augmented Transformer for Speech Recognition [60.119604551507805]
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR)
We propose the convolution-augmented transformer for speech recognition, named Conformer.
On the widely used LibriSpeech benchmark, our model achieves WER of 2.1%/4.3% without using a language model and 1.9%/3.9% with an external language model on test/testother.
arXiv Detail & Related papers (2020-05-16T20:56:25Z) - AutoSpeech: Neural Architecture Search for Speaker Recognition [108.69505815793028]
We propose the first neural architecture search approach approach for the speaker recognition tasks, named as AutoSpeech.
Our algorithm first identifies the optimal operation combination in a neural cell and then derives a CNN model by stacking the neural cell for multiple times.
Results demonstrate that the derived CNN architectures significantly outperform current speaker recognition systems based on VGG-M, ResNet-18, and ResNet-34 back-bones, while enjoying lower model complexity.
arXiv Detail & Related papers (2020-05-07T02:53:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.