Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion
- URL: http://arxiv.org/abs/2011.01678v2
- Date: Thu, 17 Jun 2021 12:50:49 GMT
- Title: Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion
- Authors: Disong Wang, Songxiang Liu, Lifa Sun, Xixin Wu, Xunying Liu and Helen
Meng
- Abstract summary: We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
- Score: 60.808838088376675
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Though significant progress has been made for the voice conversion (VC) of
typical speech, VC for atypical speech, e.g., dysarthric and second-language
(L2) speech, remains a challenge, since it involves correcting for atypical
prosody while maintaining speaker identity. To address this issue, we propose a
VC system with explicit prosodic modelling and deep speaker embedding (DSE)
learning. First, a speech-encoder strives to extract robust phoneme embeddings
from atypical speech. Second, a prosody corrector takes in phoneme embeddings
to infer typical phoneme duration and pitch values. Third, a conversion model
takes phoneme embeddings and typical prosody features as inputs to generate the
converted speech, conditioned on the target DSE that is learned via speaker
encoder or speaker adaptation. Extensive experiments demonstrate that speaker
adaptation can achieve higher speaker similarity, and the speaker encoder based
conversion model can greatly reduce dysarthric and non-native pronunciation
patterns with improved speech intelligibility. A comparison of speech
recognition results between the original dysarthric speech and converted speech
show that absolute reduction of 47.6% character error rate (CER) and 29.3% word
error rate (WER) can be achieved.
Related papers
- SKQVC: One-Shot Voice Conversion by K-Means Quantization with Self-Supervised Speech Representations [12.423959479216895]
One-shot voice conversion (VC) is a method that enables the transformation between any two speakers using only a single target speaker utterance.
Recent works utilizing K-means quantization (KQ) with self-supervised learning (SSL) features have proven capable of capturing content information from speech.
We propose a simple yet effective one-shot VC model that utilizes the characteristics of SSL features and speech attributes.
arXiv Detail & Related papers (2024-11-25T07:14:26Z) - CoLM-DSR: Leveraging Neural Codec Language Modeling for Multi-Modal Dysarthric Speech Reconstruction [61.067153685104394]
Dysarthric speech reconstruction (DSR) aims to transform dysarthric speech into normal speech.
It still suffers from low speaker similarity and poor prosody naturalness.
We propose a multi-modal DSR model by leveraging neural language modeling to improve the reconstruction results.
arXiv Detail & Related papers (2024-06-12T15:42:21Z) - Self-Supervised Speech Representations Preserve Speech Characteristics
while Anonymizing Voices [15.136348385992047]
We train several voice conversion models using self-supervised speech representations.
Converted voices retain a low word error rate within 1% of the original voice.
Experiments on dysarthric speech data show that speech features relevant to articulation, prosody, phonation and phonology can be extracted from anonymized voices.
arXiv Detail & Related papers (2022-04-04T17:48:01Z) - Speaker Identity Preservation in Dysarthric Speech Reconstruction by
Adversarial Speaker Adaptation [59.41186714127256]
Dysarthric speech reconstruction (DSR) aims to improve the quality of dysarthric speech.
Speaker encoder (SE) optimized for speaker verification has been explored to control the speaker identity.
We propose a novel multi-task learning strategy, i.e., adversarial speaker adaptation (ASA)
arXiv Detail & Related papers (2022-02-18T08:59:36Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Adversarially learning disentangled speech representations for robust
multi-factor voice conversion [39.91395314356084]
We propose a disentangled speech representation learning framework based on adversarial learning.
Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled.
Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
arXiv Detail & Related papers (2021-01-30T08:29:55Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.