Towards Identity Preserving Normal to Dysarthric Voice Conversion
- URL: http://arxiv.org/abs/2110.08213v1
- Date: Fri, 15 Oct 2021 17:18:02 GMT
- Title: Towards Identity Preserving Normal to Dysarthric Voice Conversion
- Authors: Wen-Chin Huang, Bence Mark Halpern, Lester Phillip Violeta, Odette
Scharenborg, Tomoki Toda
- Abstract summary: We present a framework that converts normal speech into dysarthric speech while preserving the speaker identity.
This is essential for (1) clinical decision making processes and alleviation of patient stress, and (2) data augmentation for dysarthric speech recognition.
- Score: 37.648612382457756
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present a voice conversion framework that converts normal speech into
dysarthric speech while preserving the speaker identity. Such a framework is
essential for (1) clinical decision making processes and alleviation of patient
stress, (2) data augmentation for dysarthric speech recognition. This is an
especially challenging task since the converted samples should capture the
severity of dysarthric speech while being highly natural and possessing the
speaker identity of the normal speaker. To this end, we adopted a two-stage
framework, which consists of a sequence-to-sequence model and a nonparallel
frame-wise model. Objective and subjective evaluations were conducted on the
UASpeech dataset, and results showed that the method was able to yield
reasonable naturalness and capture severity aspects of the pathological speech.
On the other hand, the similarity to the normal source speaker's voice was
limited and requires further improvements.
Related papers
- Perceiver-Prompt: Flexible Speaker Adaptation in Whisper for Chinese Disordered Speech Recognition [40.44769351506048]
Perceiver-Prompt is a method for speaker adaptation that utilizes P-Tuning on the Whisper large-scale model.
We first fine-tune Whisper using LoRA and then integrate a trainable Perceiver to generate fixed-length speaker prompts from variable-length inputs.
arXiv Detail & Related papers (2024-06-14T09:36:46Z) - Use of Speech Impairment Severity for Dysarthric Speech Recognition [37.93801885333925]
This paper proposes a novel set of techniques to use both severity and speaker-identity in dysarthric speech recognition.
Experiments conducted on UASpeech suggest incorporating speech impairment severity into state-of-the-art hybrid DNN, E2E Conformer and pre-trained Wav2vec 2.0 ASR systems.
arXiv Detail & Related papers (2023-05-18T02:42:59Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - Investigation of Data Augmentation Techniques for Disordered Speech
Recognition [69.50670302435174]
This paper investigates a set of data augmentation techniques for disordered speech recognition.
Both normal and disordered speech were exploited in the augmentation process.
The final speaker adapted system constructed using the UASpeech corpus and the best augmentation approach based on speed perturbation produced up to 2.92% absolute word error rate (WER)
arXiv Detail & Related papers (2022-01-14T17:09:22Z) - Pathological voice adaptation with autoencoder-based voice conversion [15.687800631199616]
Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics.
This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech.
arXiv Detail & Related papers (2021-06-15T20:38:10Z) - A Preliminary Study of a Two-Stage Paradigm for Preserving Speaker
Identity in Dysarthric Voice Conversion [50.040466658605524]
We propose a new paradigm for maintaining speaker identity in dysarthric voice conversion (DVC)
The poor quality of dysarthric speech can be greatly improved by statistical VC.
But as the normal speech utterances of a dysarthria patient are nearly impossible to collect, previous work failed to recover the individuality of the patient.
arXiv Detail & Related papers (2021-06-02T18:41:03Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - Disentangled Speech Embeddings using Cross-modal Self-supervision [119.94362407747437]
We develop a self-supervised learning objective that exploits the natural cross-modal synchrony between faces and audio in video.
We construct a two-stream architecture which: (1) shares low-level features common to both representations; and (2) provides a natural mechanism for explicitly disentangling these factors.
arXiv Detail & Related papers (2020-02-20T14:13:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.