Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment
- URL: http://arxiv.org/abs/2307.03296v2
- Date: Wed, 20 Mar 2024 19:08:06 GMT
- Title: Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment
- Authors: Aref Farhadipour, Hadi Veisi,
- Abstract summary: Dysarthria is a disability that causes a disturbance in the human speech system.
We introduce gammatonegram as an effective method to represent audio files with discriminative details.
We convert each speech file into an image and propose image recognition system to classify speech in different scenarios.
- Score: 1.0359008237358598
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dysarthria is a disability that causes a disturbance in the human speech system and reduces the quality and intelligibility of a person's speech. Because of this effect, the normal speech processing systems can not work properly on impaired speech. This disability is usually associated with physical disabilities. Therefore, designing a system that can perform some tasks by receiving voice commands in the smart home can be a significant achievement. In this work, we introduce gammatonegram as an effective method to represent audio files with discriminative details, which is used as input for the convolutional neural network. On the other word, we convert each speech file into an image and propose image recognition system to classify speech in different scenarios. Proposed CNN is based on the transfer learning method on the pre-trained Alexnet. In this research, the efficiency of the proposed system for speech recognition, speaker identification, and intelligibility assessment is evaluated. According to the results on the UA dataset, the proposed speech recognition system achieved 91.29% accuracy in speaker-dependent mode, the speaker identification system acquired 87.74% accuracy in text-dependent mode, and the intelligibility assessment system achieved 96.47% accuracy in two-class mode. Finally, we propose a multi-network speech recognition system that works fully automatically. This system is located in a cascade arrangement with the two-class intelligibility assessment system, and the output of this system activates each one of the speech recognition networks. This architecture achieves an accuracy of 92.3% WRR. The source code of this paper is available.
Related papers
- L2 proficiency assessment using self-supervised speech representations [35.70742768910494]
This work extends the initial analysis conducted on a self-supervised speech representation based scheme, requiring no speech recognition, to a large scale proficiency test.
The performance of the self-supervised, wav2vec 2.0, system is compared to a high performance hand-crafted assessment system and a BERT-based text system.
Though the wav2vec 2.0 based system is found to be sensitive to the nature of the response, it can be configured to yield comparable performance to systems requiring a speech transcription.
arXiv Detail & Related papers (2022-11-16T11:47:20Z) - Text Independent Speaker Identification System for Access Control [0.0]
Even human intelligence system fails to offer 100% accuracy in identifying speeches from a specific individual.
This paper presents a text-independent speaker identification system that employs Mel Frequency Cepstral Coefficients (MFCC) for feature extraction and k-Nearest Neighbor (kNN) for classification.
arXiv Detail & Related papers (2022-09-26T14:42:18Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - Automatic Speech recognition for Speech Assessment of Preschool Children [4.554894288663752]
The acoustic and linguistic features of preschool speech are investigated in this study.
Wav2Vec 2.0 is a paradigm that could be used to build a robust end-to-end speech recognition system.
arXiv Detail & Related papers (2022-03-24T07:15:24Z) - Recent Progress in the CUHK Dysarthric Speech Recognition System [66.69024814159447]
Disordered speech presents a wide spectrum of challenges to current data intensive deep neural networks (DNNs) based automatic speech recognition technologies.
This paper presents recent research efforts at the Chinese University of Hong Kong to improve the performance of disordered speech recognition systems.
arXiv Detail & Related papers (2022-01-15T13:02:40Z) - Analysis and Tuning of a Voice Assistant System for Dysfluent Speech [7.233685721929227]
Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks.
We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
arXiv Detail & Related papers (2021-06-18T20:58:34Z) - Streaming Multi-talker Speech Recognition with Joint Speaker
Identification [77.46617674133556]
SURIT employs the recurrent neural network transducer (RNN-T) as the backbone for both speech recognition and speaker identification.
We validate our idea on the Librispeech dataset -- a multi-talker dataset derived from Librispeech, and present encouraging results.
arXiv Detail & Related papers (2021-04-05T18:37:33Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - Speaker De-identification System using Autoencoders and Adversarial
Training [58.720142291102135]
We propose a speaker de-identification system based on adversarial training and autoencoders.
Experimental results show that combining adversarial learning and autoencoders increase the equal error rate of a speaker verification system.
arXiv Detail & Related papers (2020-11-09T19:22:05Z) - A Machine of Few Words -- Interactive Speaker Recognition with
Reinforcement Learning [35.36769027019856]
We present a new paradigm for automatic speaker recognition that we call Interactive Speaker Recognition (ISR)
In this paradigm, the recognition system aims to incrementally build a representation of the speakers by requesting personalized utterances.
We show that our method achieves excellent performance while using little speech signal amounts.
arXiv Detail & Related papers (2020-08-07T12:44:08Z) - Speaker Diarization with Lexical Information [59.983797884955]
This work presents a novel approach for speaker diarization to leverage lexical information provided by automatic speech recognition.
We propose a speaker diarization system that can incorporate word-level speaker turn probabilities with speaker embeddings into a speaker clustering process to improve the overall diarization accuracy.
arXiv Detail & Related papers (2020-04-13T17:16:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.