Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered
Speech
- URL: http://arxiv.org/abs/2211.09731v1
- Date: Fri, 4 Nov 2022 23:45:31 GMT
- Title: Stutter-TTS: Controlled Synthesis and Improved Recognition of Stuttered
Speech
- Authors: Xin Zhang, Iv\'an Vall\'es-P\'erez, Andreas Stolcke, Chengzhu Yu,
Jasha Droppo, Olabanji Shonibare, Roberto Barra-Chicote, Venkatesh
Ravichandran
- Abstract summary: Stuttering is a speech disorder where the natural flow of speech is interrupted by blocks, repetitions or prolongations of syllables, words and phrases.
We describe Stutter-TTS, an end-to-end neural text-to-speech model capable of synthesizing diverse types of stuttering utterances.
- Score: 20.2646788350211
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Stuttering is a speech disorder where the natural flow of speech is
interrupted by blocks, repetitions or prolongations of syllables, words and
phrases. The majority of existing automatic speech recognition (ASR) interfaces
perform poorly on utterances with stutter, mainly due to lack of matched
training data. Synthesis of speech with stutter thus presents an opportunity to
improve ASR for this type of speech. We describe Stutter-TTS, an end-to-end
neural text-to-speech model capable of synthesizing diverse types of stuttering
utterances. We develop a simple, yet effective prosody-control strategy whereby
additional tokens are introduced into source text during training to represent
specific stuttering characteristics. By choosing the position of the stutter
tokens, Stutter-TTS allows word-level control of where stuttering occurs in the
synthesized utterance. We are able to synthesize stutter events with high
accuracy (F1-scores between 0.63 and 0.84, depending on stutter type). By
fine-tuning an ASR model on synthetic stuttered speech we are able to reduce
word error by 5.7% relative on stuttered utterances, with only minor (<0.2%
relative) degradation for fluent utterances.
Related papers
- Self-supervised Speech Models for Word-Level Stuttered Speech Detection [66.46810024006712]
We introduce a word-level stuttering speech detection model leveraging self-supervised speech models.
Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection.
arXiv Detail & Related papers (2024-09-16T20:18:20Z) - MMSD-Net: Towards Multi-modal Stuttering Detection [9.257985820122999]
MMSD-Net is the first multi-modal neural framework for stuttering detection.
Our model yields an improvement of 2-17% in the F1-score over existing state-of-the-art uni-modal approaches.
arXiv Detail & Related papers (2024-07-16T08:26:59Z) - Inclusive ASR for Disfluent Speech: Cascaded Large-Scale Self-Supervised Learning with Targeted Fine-Tuning and Data Augmentation [0.0]
A critical barrier to progress is the scarcity of large, annotated disfluent speech datasets.
We present an inclusive ASR design approach, leveraging self-supervised learning on standard speech followed by targeted fine-tuning and data augmentation.
Results show that fine-tuning wav2vec 2.0 with even a relatively small, labeled dataset, alongside data augmentation, can significantly reduce word error rates for disfluent speech.
arXiv Detail & Related papers (2024-06-14T16:56:40Z) - EXPRESSO: A Benchmark and Analysis of Discrete Expressive Speech
Resynthesis [49.04496602282718]
We introduce Expresso, a high-quality expressive speech dataset for textless speech synthesis.
This dataset includes both read speech and improvised dialogues rendered in 26 spontaneous expressive styles.
We evaluate resynthesis quality with automatic metrics for different self-supervised discrete encoders.
arXiv Detail & Related papers (2023-08-10T17:41:19Z) - DisfluencyFixer: A tool to enhance Language Learning through Speech To
Speech Disfluency Correction [50.51901599433536]
DisfluencyFixer is a tool that performs speech-to-speech disfluency correction in English and Hindi.
Our proposed system removes disfluencies from input speech and returns fluent speech as output.
arXiv Detail & Related papers (2023-05-26T14:13:38Z) - NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot
Speech and Singing Synthesizers [90.83782600932567]
We develop NaturalSpeech 2, a TTS system that leverages a neural audio predictor with residual vectorizers to get the quantized latent vectors.
We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers.
NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, synthesis, and voice quality in a zero-shot setting.
arXiv Detail & Related papers (2023-04-18T16:31:59Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Detecting Dysfluencies in Stuttering Therapy Using wav2vec 2.0 [0.22940141855172028]
Fine-tuning wav2vec 2.0 for the classification of stuttering on a sizeable English corpus boosts the effectiveness of the general-purpose features.
We evaluate our method on Fluencybank and the German therapy-centric Kassel State of Fluency dataset.
arXiv Detail & Related papers (2022-04-07T13:02:12Z) - Enhancing ASR for Stuttered Speech with Limited Data Using Detect and
Pass [0.0]
It is estimated that around 70 million people worldwide are affected by a speech disorder called stuttering.
We propose a simple but effective method called 'Detect and Pass' to make modern ASR systems accessible for People Who Stutter.
arXiv Detail & Related papers (2022-02-08T19:55:23Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Towards Automated Assessment of Stuttering and Stuttering Therapy [0.22940141855172028]
Stuttering is a complex speech disorder that can be identified by repetitions, prolongations of sounds, syllables or words, and blocks while speaking.
Common methods for the assessment of stuttering severity include percent stuttered syllables (% SS), the average of the three longest stuttering symptoms during a speech task, or the recently introduced Speech Efficiency Score (SES)
This paper introduces the Speech Control Index (SCI), a new method to evaluate the severity of stuttering.
arXiv Detail & Related papers (2020-06-16T14:50:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.