Quartered Chirp Spectral Envelope for Whispered vs Normal Speech Classification
- URL: http://arxiv.org/abs/2408.14777v1
- Date: Tue, 27 Aug 2024 04:56:22 GMT
- Title: Quartered Chirp Spectral Envelope for Whispered vs Normal Speech Classification
- Authors: S. Johanan Joysingh, P. Vijayalakshmi, T. Nagarajan,
- Abstract summary: We propose a new feature named the quartered chirp spectral envelope to classify whispered and normal speech.
The feature is trained on a one dimensional convolutional neural network, that captures the trends in the spectral envelope.
The proposed system performs better than the state of the art, in the presence of white noise.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Whispered speech as an acceptable form of human-computer interaction is gaining traction. Systems that address multiple modes of speech require a robust front-end speech classifier. Performance of whispered vs normal speech classification drops in the presence of additive white Gaussian noise, since normal speech takes on some of the characteristics of whispered speech. In this work, we propose a new feature named the quartered chirp spectral envelope, a combination of the chirp spectrum and the quartered spectral envelope, to classify whispered and normal speech. The chirp spectrum can be fine-tuned to obtain customized features for a given task, and the quartered spectral envelope has been proven to work especially well for the current task. The feature is trained on a one dimensional convolutional neural network, that captures the trends in the spectral envelope. The proposed system performs better than the state of the art, in the presence of white noise.
Related papers
- Quartered Spectral Envelope and 1D-CNN-based Classification of Normally Phonated and Whispered Speech [0.0]
The presence of pitch and pitch harmonics in normal speech, and its absence in whispered speech, is evident in the spectral envelope of the Fourier transform.
We propose the use of one dimensional convolutional neural networks (1D-CNN) to capture these features.
The system yields an accuracy of 99.31% when trained and tested on the wTIMIT dataset, and 100% on the CHAINS dataset.
arXiv Detail & Related papers (2024-08-25T07:17:11Z) - Learning Separable Hidden Unit Contributions for Speaker-Adaptive Lip-Reading [73.59525356467574]
A speaker's own characteristics can always be portrayed well by his/her few facial images or even a single image with shallow networks.
Fine-grained dynamic features associated with speech content expressed by the talking face always need deep sequential networks.
Our approach consistently outperforms existing methods.
arXiv Detail & Related papers (2023-10-08T07:48:25Z) - Self-supervised Fine-tuning for Improved Content Representations by
Speaker-invariant Clustering [78.2927924732142]
We propose speaker-invariant clustering (Spin) as a novel self-supervised learning method.
Spin disentangles speaker information and preserves content representations with just 45 minutes of fine-tuning on a single GPU.
arXiv Detail & Related papers (2023-05-18T15:59:36Z) - Speaker Adaptation Using Spectro-Temporal Deep Features for Dysarthric
and Elderly Speech Recognition [48.33873602050463]
Speaker adaptation techniques play a key role in personalization of ASR systems for such users.
Motivated by the spectro-temporal level differences between dysarthric, elderly and normal speech.
Novel spectrotemporal subspace basis deep embedding features derived using SVD speech spectrum.
arXiv Detail & Related papers (2022-02-21T15:11:36Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Learning spectro-temporal representations of complex sounds with
parameterized neural networks [16.270691619752288]
We propose a parametrized neural network layer, that computes specific spectro-temporal modulations based on Gabor kernels (Learnable STRFs)
We evaluated predictive capabilities of this layer on Speech Activity Detection, Speaker Verification, Urban Sound Classification and Zebra Finch Call Type Classification.
As this layer is fully interpretable, we used quantitative measures to describe the distribution of the learned spectro-temporal modulations.
arXiv Detail & Related papers (2021-03-12T07:53:47Z) - Few Shot Adaptive Normalization Driven Multi-Speaker Speech Synthesis [18.812696623555855]
We present a novel few shot multi-speaker speech synthesis approach (FSM-SS)
Given an input text and a reference speech sample of an unseen person, FSM-SS can generate speech in that person's style in a few shot manner.
We demonstrate how the affine parameters of normalization help in capturing the prosodic features such as energy and fundamental frequency.
arXiv Detail & Related papers (2020-12-14T04:37:07Z) - End-to-end Whispered Speech Recognition with Frequency-weighted
Approaches and Pseudo Whisper Pre-training [130.56878980058966]
We present several approaches for end-to-end (E2E) recognition of whispered speech.
We achieve an overall relative reduction of 19.8% in PER and 44.4% in CER on a relatively small whispered TIMIT corpus.
As long as we have a good E2E model pre-trained on normal or pseudo-whispered speech, a relatively small set of whispered speech may suffice to obtain a reasonably good E2E whispered speech recognizer.
arXiv Detail & Related papers (2020-05-05T07:08:53Z) - Improving speaker discrimination of target speech extraction with
time-domain SpeakerBeam [100.95498268200777]
SpeakerBeam exploits an adaptation utterance of the target speaker to extract his/her voice characteristics.
SpeakerBeam sometimes fails when speakers have similar voice characteristics, such as in same-gender mixtures.
We show experimentally that these strategies greatly improve speech extraction performance, especially for same-gender mixtures.
arXiv Detail & Related papers (2020-01-23T05:36:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.