Enhancing dysarthria speech feature representation with empirical mode
decomposition and Walsh-Hadamard transform
- URL: http://arxiv.org/abs/2401.00225v1
- Date: Sat, 30 Dec 2023 13:25:26 GMT
- Title: Enhancing dysarthria speech feature representation with empirical mode
decomposition and Walsh-Hadamard transform
- Authors: Ting Zhu, Shufei Duan, Camille Dingam, Huizhi Liang, Wei Zhang
- Abstract summary: We propose a feature enhancement for dysarthria speech called WHFEMD.
It combines empirical mode decomposition (EMD) and fast Walsh-Hadamard transform (FWHT) to enhance features.
- Score: 8.032273183441921
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Dysarthria speech contains the pathological characteristics of vocal tract
and vocal fold, but so far, they have not yet been included in traditional
acoustic feature sets. Moreover, the nonlinearity and non-stationarity of
speech have been ignored. In this paper, we propose a feature enhancement
algorithm for dysarthria speech called WHFEMD. It combines empirical mode
decomposition (EMD) and fast Walsh-Hadamard transform (FWHT) to enhance
features. With the proposed algorithm, the fast Fourier transform of the
dysarthria speech is first performed and then followed by EMD to get intrinsic
mode functions (IMFs). After that, FWHT is used to output new coefficients and
to extract statistical features based on IMFs, power spectral density, and
enhanced gammatone frequency cepstral coefficients. To evaluate the proposed
approach, we conducted experiments on two public pathological speech databases
including UA Speech and TORGO. The results show that our algorithm performed
better than traditional features in classification. We achieved improvements of
13.8% (UA Speech) and 3.84% (TORGO), respectively. Furthermore, the
incorporation of an imbalanced classification algorithm to address data
imbalance has resulted in a 12.18% increase in recognition accuracy. This
algorithm effectively addresses the challenges of the imbalanced dataset and
non-linearity in dysarthric speech and simultaneously provides a robust
representation of the local pathological features of the vocal folds and
tracts.
Related papers
- DiffATR: Diffusion-based Generative Modeling for Audio-Text Retrieval [49.076590578101985]
We present a diffusion-based ATR framework (DiffATR) that generates joint distribution from noise.
Experiments on the AudioCaps and Clotho datasets with superior performances, verify the effectiveness of our approach.
arXiv Detail & Related papers (2024-09-16T06:33:26Z) - DEFN: Dual-Encoder Fourier Group Harmonics Network for Three-Dimensional Indistinct-Boundary Object Segmentation [6.0920148653974255]
We introduce Defect Injection (SDi) to augment the representational diversity of challenging indistinct-boundary objects within training corpora.
Consequently, we propose the Dual-Encoder Fourier Group Harmonics Network (DEFN) to tailor incorporating noise, amplify detailed feature recognition, and bolster representation across diverse medical imaging scenarios.
arXiv Detail & Related papers (2023-11-01T12:33:04Z) - Analysis and Detection of Pathological Voice using Glottal Source
Features [18.80191660913831]
Glottal source features are extracted using glottal flows estimated with the quasi-closed phase (QCP) glottal inverse filtering method.
We derive mel-frequency cepstral coefficients (MFCCs) from the glottal source waveforms computed by QCP and ZFF.
Analysis of features revealed that the glottal source contains information that discriminates normal and pathological voice.
arXiv Detail & Related papers (2023-09-25T12:14:25Z) - Modality-Agnostic Variational Compression of Implicit Neural
Representations [96.35492043867104]
We introduce a modality-agnostic neural compression algorithm based on a functional view of data and parameterised as an Implicit Neural Representation (INR)
Bridging the gap between latent coding and sparsity, we obtain compact latent representations non-linearly mapped to a soft gating mechanism.
After obtaining a dataset of such latent representations, we directly optimise the rate/distortion trade-off in a modality-agnostic space using neural compression.
arXiv Detail & Related papers (2023-01-23T15:22:42Z) - Exploiting Cross-domain And Cross-Lingual Ultrasound Tongue Imaging
Features For Elderly And Dysarthric Speech Recognition [55.25565305101314]
Articulatory features are invariant to acoustic signal distortion and have been successfully incorporated into automatic speech recognition systems.
This paper presents a cross-domain and cross-lingual A2A inversion approach that utilizes the parallel audio and ultrasound tongue imaging (UTI) data of the 24-hour TaL corpus in A2A model pre-training.
Experiments conducted on three tasks suggested incorporating the generated articulatory features consistently outperformed the baseline TDNN and Conformer ASR systems.
arXiv Detail & Related papers (2022-06-15T07:20:28Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Gated Recurrent Fusion with Joint Training Framework for Robust
End-to-End Speech Recognition [64.9317368575585]
This paper proposes a gated recurrent fusion (GRF) method with joint training framework for robust end-to-end ASR.
The GRF algorithm is used to dynamically combine the noisy and enhanced features.
The proposed method achieves the relative character error rate (CER) reduction of 10.04% over the conventional joint enhancement and transformer method.
arXiv Detail & Related papers (2020-11-09T08:52:05Z) - A Comparative Re-Assessment of Feature Extractors for Deep Speaker
Embeddings [18.684888457998284]
We provide extensive re-assessment of 14 feature extractors on VoxCeleb and SITW datasets.
Our findings reveal that features equipped with techniques such as spectral centroids, group delay function, and integrated noise suppression provide promising alternatives to MFCCs for deep speaker embeddings extraction.
arXiv Detail & Related papers (2020-07-30T07:55:58Z) - Glottal source estimation robustness: A comparison of sensitivity of
voice source estimation techniques [11.97036509133719]
This paper addresses the problem of estimating the voice source directly from speech waveforms.
A novel principle based on Anticausality Dominated Regions (ACDR) is used to estimate the glottal open phase.
arXiv Detail & Related papers (2020-05-24T08:13:47Z) - Simultaneous Denoising and Dereverberation Using Deep Embedding Features [64.58693911070228]
We propose a joint training method for simultaneous speech denoising and dereverberation using deep embedding features.
At the denoising stage, the DC network is leveraged to extract noise-free deep embedding features.
At the dereverberation stage, instead of using the unsupervised K-means clustering algorithm, another neural network is utilized to estimate the anechoic speech.
arXiv Detail & Related papers (2020-04-06T06:34:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.