Frequency-centroid features for word recognition of non-native English
speakers
- URL: http://arxiv.org/abs/2206.07176v1
- Date: Tue, 14 Jun 2022 21:19:49 GMT
- Title: Frequency-centroid features for word recognition of non-native English
speakers
- Authors: Pierre Berjon, Rajib Sharma, Avishek Nag, and Soumyabrata Dev
- Abstract summary: The aim of this work is to investigate complementary features which can aid the quintessential Mel frequency cepstral coefficients (MFCCs)
FCs encapsulate the spectral centres of the different bands of the speech spectrum, with the bands defined by the Mel filterbank.
A two-stage Convolution Neural Network (CNN) is used to model the features of the English words uttered with Arabic, French and Spanish accents.
- Score: 1.9249287163937974
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The objective of this work is to investigate complementary features which can
aid the quintessential Mel frequency cepstral coefficients (MFCCs) in the task
of closed, limited set word recognition for non-native English speakers of
different mother-tongues. Unlike the MFCCs, which are derived from the spectral
energy of the speech signal, the proposed frequency-centroids (FCs) encapsulate
the spectral centres of the different bands of the speech spectrum, with the
bands defined by the Mel filterbank. These features, in combination with the
MFCCs, are observed to provide relative performance improvement in English word
recognition, particularly under varied noisy conditions. A two-stage
Convolution Neural Network (CNN) is used to model the features of the English
words uttered with Arabic, French and Spanish accents.
Related papers
- Advanced Clustering Techniques for Speech Signal Enhancement: A Review and Metanalysis of Fuzzy C-Means, K-Means, and Kernel Fuzzy C-Means Methods [0.6530047924748276]
Speech signal processing is tasked with improving the clarity and comprehensibility of audio data in noisy environments.
The quality of speech recognition directly impacts user experience and accessibility in technology-driven communication.
This review paper explores advanced clustering techniques, particularly focusing on the Kernel Fuzzy C-Means (KFCM) method.
arXiv Detail & Related papers (2024-09-28T20:21:05Z) - Explaining Spectrograms in Machine Learning: A Study on Neural Networks for Speech Classification [2.4472308031704073]
This study investigates discriminative patterns learned by neural networks for accurate speech classification.
By examining the activations and features of neural networks for vowel classification, we gain insights into what the networks "see" in spectrograms.
arXiv Detail & Related papers (2024-07-10T07:37:18Z) - End-to-End User-Defined Keyword Spotting using Shifted Delta Coefficients [6.626696929949397]
We propose to use shifted delta coefficients (SDC) which help in capturing pronunciation variability.
The proposed approach demonstrated superior performance when compared to state-of-the-art UDKWS techniques.
arXiv Detail & Related papers (2024-05-23T12:24:01Z) - Spiking-LEAF: A Learnable Auditory front-end for Spiking Neural Networks [53.31894108974566]
Spiking-LEAF is a learnable auditory front-end meticulously designed for SNN-based speech processing.
On keyword spotting and speaker identification tasks, the proposed Spiking-LEAF outperforms both SOTA spiking auditory front-ends.
arXiv Detail & Related papers (2023-09-18T04:03:05Z) - Incorporating Class-based Language Model for Named Entity Recognition in Factorized Neural Transducer [50.572974726351504]
We propose C-FNT, a novel E2E model that incorporates class-based LMs into FNT.
In C-FNT, the LM score of named entities can be associated with the name class instead of its surface form.
The experimental results show that our proposed C-FNT significantly reduces error in named entities without hurting performance in general word recognition.
arXiv Detail & Related papers (2023-09-14T12:14:49Z) - Audio-visual multi-channel speech separation, dereverberation and
recognition [70.34433820322323]
This paper proposes an audio-visual multi-channel speech separation, dereverberation and recognition approach.
The advantage of the additional visual modality over using audio only is demonstrated on two neural dereverberation approaches.
Experiments conducted on the LRS2 dataset suggest that the proposed audio-visual multi-channel speech separation, dereverberation and recognition system outperforms the baseline.
arXiv Detail & Related papers (2022-04-05T04:16:03Z) - MFA: TDNN with Multi-scale Frequency-channel Attention for
Text-independent Speaker Verification with Short Utterances [94.70787497137854]
We propose a multi-scale frequency-channel attention (MFA) to characterize speakers at different scales through a novel dual-path design which consists of a convolutional neural network and TDNN.
We evaluate the proposed MFA on the VoxCeleb database and observe that the proposed framework with MFA can achieve state-of-the-art performance while reducing parameters and complexity.
arXiv Detail & Related papers (2022-02-03T14:57:05Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Vowel-based Meeteilon dialect identification using a Random Forest
classifier [0.0]
vowel dataset is created using Meeteilon Speech Corpora available at Linguistic Data Consortium for Indian Languages (LDC-IL)
Spectral features such as formant frequencies (F1, F1 and F3) and prosodic features such as pitch (F0), energy, intensity and segment duration values are extracted from monophthong vowel sounds.
Random forest, a decision tree-based ensemble algorithm is used for classification of three major dialects of Meeteilon namely, Imphal, Kakching and Sekmai.
arXiv Detail & Related papers (2021-07-26T04:09:00Z) - On the Use of Audio Fingerprinting Features for Speech Enhancement with
Generative Adversarial Network [24.287237963000745]
Time-frequency domain features, such as the Short-Term Fourier Transform (STFT) and Mel-Frequency Cepstral Coefficients (MFCC) are preferred in many approaches.
While the MFCC provide for a compact representation, they ignore the dynamics and distribution of energy in each mel-scale subband.
In this work, a speech enhancement system based on Generative Adversarial Network (GAN) is implemented and tested with a combination of AFP and the Normalized Spectral Subband Centroids (NSSC)
arXiv Detail & Related papers (2020-07-27T00:44:16Z) - The Secret is in the Spectra: Predicting Cross-lingual Task Performance
with Spectral Similarity Measures [83.53361353172261]
We present a large-scale study focused on the correlations between monolingual embedding space similarity and task performance.
We introduce several isomorphism measures between two embedding spaces, based on the relevant statistics of their individual spectra.
We empirically show that 1) language similarity scores derived from such spectral isomorphism measures are strongly associated with performance observed in different cross-lingual tasks.
arXiv Detail & Related papers (2020-01-30T00:09:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.