I-vector Based Within Speaker Voice Quality Identification on connected
speech
- URL: http://arxiv.org/abs/2102.07307v1
- Date: Mon, 15 Feb 2021 02:26:32 GMT
- Title: I-vector Based Within Speaker Voice Quality Identification on connected
speech
- Authors: Chuyao Feng, Eva van Leer, Mackenzie Lee Curtis, David V. Anderson
- Abstract summary: Voice disorders affect a large portion of the population, especially heavy voice users such as teachers or call-center workers.
Most voice disorders can be treated with behavioral voice therapy, which teaches patients to replace problematic, habituated voice production mechanics.
We built two systems that automatically differentiate various voice qualities produced by the same individual.
- Score: 3.2116198597240846
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice disorders affect a large portion of the population, especially heavy
voice users such as teachers or call-center workers. Most voice disorders can
be treated effectively with behavioral voice therapy, which teaches patients to
replace problematic, habituated voice production mechanics with optimal voice
production technique(s), yielding improved voice quality. However, treatment
often fails because patients have difficulty differentiating their habitual
voice from the target technique independently, when clinician feedback is
unavailable between therapy sessions. Therefore, with the long term aim to
extend clinician feedback to extra-clinical settings, we built two systems that
automatically differentiate various voice qualities produced by the same
individual. We hypothesized that 1) a system based on i-vectors could classify
these qualities as if they represent different speakers and 2) such a system
would outperform one based on traditional voice signal processing algorithms.
Training recordings were provided by thirteen amateur actors, each producing 5
perceptually different voice qualities in connected speech: normal, breathy,
fry, twang, and hyponasal. As hypothesized, the i-vector system outperformed
the acoustic measure system in classification accuracy (i.e. 97.5\% compared to
77.2\%, respectively). Findings are expected because the i-vector system maps
features to an integrated space which better represents each voice quality than
the 22-feature space of the baseline system. Therefore, an i-vector based
system has potential for clinical application in voice therapy and voice
training.
Related papers
- Disentangling segmental and prosodic factors to non-native speech comprehensibility [11.098498920630782]
Current accent conversion systems do not disentangle the two main sources of non-native accent: segmental and prosodic characteristics.
We present an AC system that decouples voice quality from accent, but also disentangles the latter into its segmental and prosodic characteristics.
We conduct perceptual listening tests to quantify the individual contributions of segmental features and prosody on the perceived comprehensibility of non-native speech.
arXiv Detail & Related papers (2024-08-20T16:43:55Z) - Homogeneous Speaker Features for On-the-Fly Dysarthric and Elderly Speaker Adaptation [71.31331402404662]
This paper proposes two novel data-efficient methods to learn dysarthric and elderly speaker-level features.
Speaker-regularized spectral basis embedding-SBE features that exploit a special regularization term to enforce homogeneity of speaker features in adaptation.
Feature-based learning hidden unit contributions (f-LHUC) that are conditioned on VR-LH features that are shown to be insensitive to speaker-level data quantity in testtime adaptation.
arXiv Detail & Related papers (2024-07-08T18:20:24Z) - A Novel Labeled Human Voice Signal Dataset for Misbehavior Detection [0.7223352886780369]
This research highlights the significance of voice tone and delivery in automated machine-learning systems for voice analysis and recognition.
It contributes to the broader field of voice signal analysis by elucidating the impact of human behaviour on the perception and categorization of voice signals.
arXiv Detail & Related papers (2024-06-28T18:55:07Z) - Evaluating and Personalizing User-Perceived Quality of Text-to-Speech
Voices for Delivering Mindfulness Meditation with Different Physical
Embodiments [5.413055126487447]
We evaluated the user-perceived quality of state-of-the-art text-to-speech voices for administering mindfulness meditation.
We found that the best-rated human voice was perceived better than all TTS voices.
By allowing users to fine-tune TTS voice features, the user-personalized TTS voices could perform almost as well as human voices.
arXiv Detail & Related papers (2024-01-07T21:14:32Z) - Lightly Weighted Automatic Audio Parameter Extraction for the Quality
Assessment of Consensus Auditory-Perceptual Evaluation of Voice [18.8222742272435]
The proposed method utilizes age, sex, and five audio parameters: jitter, absolute jitter, shimmer, harmonic-to-noise ratio (HNR), and zero crossing.
The result reveals that our approach performs similar to state-of-the-art (SOTA) methods, and outperforms the latent representation obtained by using popular audio pre-trained models.
arXiv Detail & Related papers (2023-11-27T07:19:22Z) - Show from Tell: Audio-Visual Modelling in Clinical Settings [58.88175583465277]
We consider audio-visual modelling in a clinical setting, providing a solution to learn medical representations without human expert annotation.
A simple yet effective multi-modal self-supervised learning framework is proposed for this purpose.
The proposed approach is able to localise anatomical regions of interest during ultrasound imaging, with only speech audio as a reference.
arXiv Detail & Related papers (2023-10-25T08:55:48Z) - Make-A-Voice: Unified Voice Synthesis With Discrete Representation [77.3998611565557]
Make-A-Voice is a unified framework for synthesizing and manipulating voice signals from discrete representations.
We show that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models.
arXiv Detail & Related papers (2023-05-30T17:59:26Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Analysis and Tuning of a Voice Assistant System for Dysfluent Speech [7.233685721929227]
Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks.
We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
arXiv Detail & Related papers (2021-06-18T20:58:34Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.