The Use of Voice Source Features for Sung Speech Recognition
- URL: http://arxiv.org/abs/2102.10376v2
- Date: Tue, 23 Feb 2021 16:18:28 GMT
- Title: The Use of Voice Source Features for Sung Speech Recognition
- Authors: Gerardo Roa Dabike, Jon Barker
- Abstract summary: We first use a parallel singing/speaking corpus to illustrate differences in sung vs spoken voicing characteristics.
We then use this analysis to inform speech recognition experiments on the sung speech DSing corpus.
Experiments are run with three standard (increasingly large) training sets, DSing1 (15.1 hours), DSing3 (44.7 hours) and DSing30 (149.1 hours)
- Score: 24.129307615741695
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we ask whether vocal source features (pitch, shimmer, jitter,
etc) can improve the performance of automatic sung speech recognition, arguing
that conclusions previously drawn from spoken speech studies may not be valid
in the sung speech domain. We first use a parallel singing/speaking corpus
(NUS-48E) to illustrate differences in sung vs spoken voicing characteristics
including pitch range, syllables duration, vibrato, jitter and shimmer. We then
use this analysis to inform speech recognition experiments on the sung speech
DSing corpus, using a state of the art acoustic model and augmenting
conventional features with various voice source parameters. Experiments are run
with three standard (increasingly large) training sets, DSing1 (15.1 hours),
DSing3 (44.7 hours) and DSing30 (149.1 hours). Pitch combined with degree of
voicing produces a significant decrease in WER from 38.1% to 36.7% when
training with DSing1 however smaller decreases in WER observed when training
with the larger more varied DSing3 and DSing30 sets were not seen to be
statistically significant. Voicing quality characteristics did not improve
recognition performance although analysis suggests that they do contribute to
an improved discrimination between voiced/unvoiced phoneme pairs.
Related papers
- Singer Identity Representation Learning using Self-Supervised Techniques [0.0]
We propose a framework for training singer identity encoders to extract representations suitable for various singing-related tasks.
We explore different self-supervised learning techniques on a large collection of isolated vocal tracks.
We evaluate the quality of the resulting representations on singer similarity and identification tasks.
arXiv Detail & Related papers (2024-01-10T10:41:38Z) - Enhancing the vocal range of single-speaker singing voice synthesis with
melody-unsupervised pre-training [82.94349771571642]
This work proposes a melody-unsupervised multi-speaker pre-training method to enhance the vocal range of the single-speaker.
It is the first to introduce a differentiable duration regulator to improve the rhythm naturalness of the synthesized voice.
Experimental results verify that the proposed SVS system outperforms the baseline on both sound quality and naturalness.
arXiv Detail & Related papers (2023-09-01T06:40:41Z) - PAAPLoss: A Phonetic-Aligned Acoustic Parameter Loss for Speech
Enhancement [41.872384434583466]
We propose a learning objective that formalizes differences in perceptual quality.
We identify temporal acoustic parameters that are non-differentiable.
We develop a neural network estimator that can accurately predict their time-series values.
arXiv Detail & Related papers (2023-02-16T05:17:06Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Supervised Contrastive Learning for Accented Speech Recognition [7.5253263976291676]
We study the supervised contrastive learning framework for accented speech recognition.
We show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average.
arXiv Detail & Related papers (2021-07-02T09:23:33Z) - Analysis and Tuning of a Voice Assistant System for Dysfluent Speech [7.233685721929227]
Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks.
We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
arXiv Detail & Related papers (2021-06-18T20:58:34Z) - High Fidelity Speech Regeneration with Application to Speech Enhancement [96.34618212590301]
We propose a wav-to-wav generative model for speech that can generate 24khz speech in a real-time manner.
Inspired by voice conversion methods, we train to augment the speech characteristics while preserving the identity of the source.
arXiv Detail & Related papers (2021-01-31T10:54:27Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Speech Enhancement using Self-Adaptation and Multi-Head Self-Attention [70.82604384963679]
This paper investigates a self-adaptation method for speech enhancement using auxiliary speaker-aware features.
We extract a speaker representation used for adaptation directly from the test utterance.
arXiv Detail & Related papers (2020-02-14T05:05:36Z) - VoiceCoach: Interactive Evidence-based Training for Voice Modulation
Skills in Public Speaking [55.366941476863644]
The modulation of voice properties, such as pitch, volume, and speed, is crucial for delivering a successful public speech.
We present VoiceCoach, an interactive evidence-based approach to facilitate the effective training of voice modulation skills.
arXiv Detail & Related papers (2020-01-22T04:52:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.