Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
- URL: http://arxiv.org/abs/2106.11759v1
- Date: Fri, 18 Jun 2021 20:58:34 GMT
- Title: Analysis and Tuning of a Voice Assistant System for Dysfluent Speech
- Authors: Vikramjit Mitra, Zifang Huang, Colin Lea, Lauren Tooley, Sarah Wu,
Darren Botten, Ashwini Palekar, Shrinath Thelapurath, Panayiotis Georgiou,
Sachin Kajarekar, Jefferey Bigham
- Abstract summary: Speech recognition systems do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks.
We show that by tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24% (relative) for individuals with fluency disorders.
- Score: 7.233685721929227
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Dysfluencies and variations in speech pronunciation can severely degrade
speech recognition performance, and for many individuals with
moderate-to-severe speech disorders, voice operated systems do not work.
Current speech recognition systems are trained primarily with data from fluent
speakers and as a consequence do not generalize well to speech with
dysfluencies such as sound or word repetitions, sound prolongations, or audible
blocks. The focus of this work is on quantitative analysis of a consumer speech
recognition system on individuals who stutter and production-oriented
approaches for improving performance for common voice assistant tasks (i.e.,
"what is the weather?"). At baseline, this system introduces a significant
number of insertion and substitution errors resulting in intended speech Word
Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with
fluency disorders. We show that by simply tuning the decoding parameters in an
existing hybrid speech recognition system one can improve isWER by 24\%
(relative) for individuals with fluency disorders. Tuning these parameters
translates to 3.6\% better domain recognition and 1.7\% better intent
recognition relative to the default setup for the 18 study participants across
all stuttering severities.
Related papers
- Gammatonegram Representation for End-to-End Dysarthric Speech Processing Tasks: Speech Recognition, Speaker Identification, and Intelligibility Assessment [1.0359008237358598]
Dysarthria is a disability that causes a disturbance in the human speech system.
We introduce gammatonegram as an effective method to represent audio files with discriminative details.
We convert each speech file into an image and propose image recognition system to classify speech in different scenarios.
arXiv Detail & Related papers (2023-07-06T21:10:50Z) - Latent Phrase Matching for Dysarthric Speech [23.23672790496787]
Many consumer speech recognition systems are not tuned for people with speech disabilities.
We propose a query-by-example-based personalized phrase recognition system that is trained using small amounts of speech.
Performance degrades as the number of phrases increases, but consistently outperforms ASR systems when trained with 50 unique phrases.
arXiv Detail & Related papers (2023-06-08T17:28:28Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Spectro-Temporal Deep Features for Disordered Speech Assessment and
Recognition [65.25325641528701]
Motivated by the spectro-temporal level differences between disordered and normal speech that systematically manifest in articulatory imprecision, decreased volume and clarity, slower speaking rates and increased dysfluencies, novel spectro-temporal subspace basis embedding deep features derived by SVD decomposition of speech spectrum are proposed.
Experiments conducted on the UASpeech corpus suggest the proposed spectro-temporal deep feature adapted systems consistently outperformed baseline i- adaptation by up to 263% absolute (8.6% relative) reduction in word error rate (WER) with or without data augmentation.
arXiv Detail & Related papers (2022-01-14T16:56:43Z) - Dynamic Acoustic Unit Augmentation With BPE-Dropout for Low-Resource
End-to-End Speech Recognition [62.94773371761236]
We consider building an effective end-to-end ASR system in low-resource setups with a high OOV rate.
We propose a method of dynamic acoustic unit augmentation based on the BPE-dropout technique.
Our monolingual Turkish Conformer established a competitive result with 22.2% character error rate (CER) and 38.9% word error rate (WER)
arXiv Detail & Related papers (2021-03-12T10:10:13Z) - I-vector Based Within Speaker Voice Quality Identification on connected
speech [3.2116198597240846]
Voice disorders affect a large portion of the population, especially heavy voice users such as teachers or call-center workers.
Most voice disorders can be treated with behavioral voice therapy, which teaches patients to replace problematic, habituated voice production mechanics.
We built two systems that automatically differentiate various voice qualities produced by the same individual.
arXiv Detail & Related papers (2021-02-15T02:26:32Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.