Adversarially learning disentangled speech representations for robust
multi-factor voice conversion
- URL: http://arxiv.org/abs/2102.00184v2
- Date: Fri, 20 Aug 2021 07:20:00 GMT
- Title: Adversarially learning disentangled speech representations for robust
multi-factor voice conversion
- Authors: Jie Wang, Jingbei Li, Xintao Zhao, Zhiyong Wu, Shiyin Kang, Helen Meng
- Abstract summary: We propose a disentangled speech representation learning framework based on adversarial learning.
Four speech representations characterizing content, timbre, rhythm and pitch are extracted, and further disentangled.
Experimental results show that the proposed framework significantly improves the robustness of VC on multiple factors.
- Score: 39.91395314356084
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Factorizing speech as disentangled speech representations is vital to achieve
highly controllable style transfer in voice conversion (VC). Conventional
speech representation learning methods in VC only factorize speech as speaker
and content, lacking controllability on other prosody-related factors.
State-of-the-art speech representation learning methods for more speechfactors
are using primary disentangle algorithms such as random resampling and ad-hoc
bottleneck layer size adjustment,which however is hard to ensure robust speech
representationdisentanglement. To increase the robustness of highly
controllable style transfer on multiple factors in VC, we propose a
disentangled speech representation learning framework based on adversarial
learning. Four speech representations characterizing content, timbre, rhythm
and pitch are extracted, and further disentangled by an adversarial
Mask-And-Predict (MAP)network inspired by BERT. The adversarial network is used
tominimize the correlations between the speech representations,by randomly
masking and predicting one of the representationsfrom the others. Experimental
results show that the proposedframework significantly improves the robustness
of VC on multiple factors by increasing the speech quality MOS from 2.79 to3.30
and decreasing the MCD from 3.89 to 3.58.
Related papers
- DM-Codec: Distilling Multimodal Representations for Speech Tokenization [11.433520275513803]
DM-Codec is a language model-guided distillation method that incorporates contextual information.
It significantly outperforms state-of-the-art speech tokenization models, reducing WER by up to 13.46%, WIL by 9.82%, and improving speech quality by 5.84% and intelligibility by 1.85% on the LibriSpeech benchmark dataset.
arXiv Detail & Related papers (2024-10-19T07:14:14Z) - Robust Multi-Modal Speech In-Painting: A Sequence-to-Sequence Approach [3.89476785897726]
We introduce and study a sequence-to-sequence (seq2seq) speech in-painting model that incorporates AV features.
Our approach extends AV speech in-painting techniques to scenarios where both audio and visual data may be jointly corrupted.
arXiv Detail & Related papers (2024-06-02T23:51:43Z) - TranSpeech: Speech-to-Speech Translation With Bilateral Perturbation [61.564874831498145]
TranSpeech is a speech-to-speech translation model with bilateral perturbation.
We establish a non-autoregressive S2ST technique, which repeatedly masks and predicts unit choices.
TranSpeech shows a significant improvement in inference latency, enabling speedup up to 21.4x than autoregressive technique.
arXiv Detail & Related papers (2022-05-25T06:34:14Z) - Wav2vec-Switch: Contrastive Learning from Original-noisy Speech Pairs
for Robust Speech Recognition [52.71604809100364]
We propose wav2vec-Switch, a method to encode noise robustness into contextualized representations of speech.
Specifically, we feed original-noisy speech pairs simultaneously into the wav2vec 2.0 network.
In addition to the existing contrastive learning task, we switch the quantized representations of the original and noisy speech as additional prediction targets.
arXiv Detail & Related papers (2021-10-11T00:08:48Z) - Comparing Supervised Models And Learned Speech Representations For
Classifying Intelligibility Of Disordered Speech On Selected Phrases [11.3463024120429]
We develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases.
We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases.
arXiv Detail & Related papers (2021-07-08T17:24:25Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Learning Explicit Prosody Models and Deep Speaker Embeddings for
Atypical Voice Conversion [60.808838088376675]
We propose a VC system with explicit prosodic modelling and deep speaker embedding learning.
A prosody corrector takes in phoneme embeddings to infer typical phoneme duration and pitch values.
A conversion model takes phoneme embeddings and typical prosody features as inputs to generate the converted speech.
arXiv Detail & Related papers (2020-11-03T13:08:53Z) - "Notic My Speech" -- Blending Speech Patterns With Multimedia [65.91370924641862]
We propose a view-temporal attention mechanism to model both the view dependence and the visemic importance in speech recognition and understanding.
Our proposed method outperformed the existing work by 4.99% in terms of the viseme error rate.
We show that there is a strong correlation between our model's understanding of multi-view speech and the human perception.
arXiv Detail & Related papers (2020-06-12T06:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.