Phase-based Information for Voice Pathology Detection
- URL: http://arxiv.org/abs/2001.00372v1
- Date: Thu, 2 Jan 2020 09:51:51 GMT
- Title: Phase-based Information for Voice Pathology Detection
- Authors: Thomas Drugman, Thomas Dubuisson, Thierry Dutoit
- Abstract summary: This paper investigates the potential of using phase-based features for automatically detecting voice disorders.
It is shown that group delay functions are appropriate for characterizing irregularities in the phonation.
- Score: 11.481208551940998
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In most current approaches of speech processing, information is extracted
from the magnitude spectrum. However recent perceptual studies have underlined
the importance of the phase component. The goal of this paper is to investigate
the potential of using phase-based features for automatically detecting voice
disorders. It is shown that group delay functions are appropriate for
characterizing irregularities in the phonation. Besides the respect of the
mixed-phase model of speech is discussed. The proposed phase-based features are
evaluated and compared to other parameters derived from the magnitude spectrum.
Both streams are shown to be interestingly complementary. Furthermore
phase-based features turn out to convey a great amount of relevant information,
leading to high discrimination performance.
Related papers
- DiffPhase: Generative Diffusion-based STFT Phase Retrieval [15.16865739526702]
Diffusion probabilistic models have been recently used in a variety of tasks, including speech enhancement and synthesis.
In this work we build upon previous work in the speech domain, adapting a speech enhancement diffusion model specifically for phase retrieval.
Evaluation using speech quality and intelligibility metrics shows the diffusion approach is well-suited to the phase retrieval task, with performance surpassing both classical and modern methods.
arXiv Detail & Related papers (2022-11-08T15:50:35Z) - Audio Deepfake Detection Based on a Combination of F0 Information and
Real Plus Imaginary Spectrogram Features [51.924340387119415]
Experimental results on the ASVspoof 2019 LA dataset show that our proposed system is very effective for the audio deepfake detection task.
Our proposed system is very effective for the audio deepfake detection task, achieving an equivalent error rate (EER) of 0.43%, which surpasses almost all systems.
arXiv Detail & Related papers (2022-08-02T02:46:16Z) - Deep Neural Convolutive Matrix Factorization for Articulatory
Representation Decomposition [48.56414496900755]
This work uses a neural implementation of convolutive sparse matrix factorization to decompose the articulatory data into interpretable gestures and gestural scores.
Phoneme recognition experiments were additionally performed to show that gestural scores indeed code phonological information successfully.
arXiv Detail & Related papers (2022-04-01T14:25:19Z) - Acoustic To Articulatory Speech Inversion Using Multi-Resolution
Spectro-Temporal Representations Of Speech Signals [5.743287315640403]
We train a feed-forward deep neural network to estimate articulatory trajectories of six tract variables.
Experiments achieved a correlation of 0.675 with ground-truth tract variables.
arXiv Detail & Related papers (2022-03-11T07:27:42Z) - Deep Learning for Hate Speech Detection: A Comparative Study [54.42226495344908]
We present here a large-scale empirical comparison of deep and shallow hate-speech detection methods.
Our goal is to illuminate progress in the area, and identify strengths and weaknesses in the current state-of-the-art.
In doing so we aim to provide guidance as to the use of hate-speech detection in practice, quantify the state-of-the-art, and identify future research directions.
arXiv Detail & Related papers (2022-02-19T03:48:20Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Deep Learning For Prominence Detection In Children's Read Speech [13.041607703862724]
We present a system that operates on segmented speech waveforms to learn features relevant to prominent word detection for children's oral fluency assessment.
The chosen CRNN (convolutional recurrent neural network) framework, incorporating both word-level features and sequence information, is found to benefit from the perceptually motivated SincNet filters.
arXiv Detail & Related papers (2021-10-27T08:51:42Z) - An Overview of Deep-Learning-Based Audio-Visual Speech Enhancement and
Separation [57.68765353264689]
Speech enhancement and speech separation are two related tasks.
Traditionally, these tasks have been tackled using signal processing and machine learning techniques.
Deep learning has been exploited to achieve strong performance.
arXiv Detail & Related papers (2020-08-21T17:24:09Z) - Identification of primary and collateral tracks in stuttered speech [22.921077940732]
We introduce a new evaluation framework for disfluency detection inspired by the clinical and NLP perspective.
We present a novel forced-aligned disfluency dataset from a corpus of semi-directed interviews.
We show experimentally that using word-based span features outperformed the baselines for speech-based predictions.
arXiv Detail & Related papers (2020-03-02T16:50:33Z) - Temporal-Spatial Neural Filter: Direction Informed End-to-End
Multi-channel Target Speech Separation [66.46123655365113]
Target speech separation refers to extracting the target speaker's speech from mixed signals.
Two main challenges are the complex acoustic environment and the real-time processing requirement.
We propose a temporal-spatial neural filter, which directly estimates the target speech waveform from multi-speaker mixture.
arXiv Detail & Related papers (2020-01-02T11:12:50Z) - On the Mutual Information between Source and Filter Contributions for
Voice Pathology Detection [11.481208551940998]
This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal.
Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody.
arXiv Detail & Related papers (2020-01-02T10:04:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.