Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments
- URL: http://arxiv.org/abs/2106.08873v1
- Date: Wed, 16 Jun 2021 15:47:06 GMT
- Title: Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments
- Authors: Alejandro Mottini, Jaime Lorenzo-Trueba, Sri Vishnu Kumar Karlapati,
Thomas Drugman
- Abstract summary: Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
- Score: 76.98764900754111
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Voice Conversion (VC) is a technique that aims to transform the
non-linguistic information of a source utterance to change the perceived
identity of the speaker. While there is a rich literature on VC, most proposed
methods are trained and evaluated on clean speech recordings. However, many
acoustic environments are noisy and reverberant, severely restricting the
applicability of popular VC methods to such scenarios. To address this
limitation, we propose Voicy, a new VC framework particularly tailored for
noisy speech. Our method, which is inspired by the de-noising auto-encoders
framework, is comprised of four encoders (speaker, content, phonetic and
acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing
non-parallel zero-shot VC, an important requirement for any VC system that
needs to work on speakers not seen during training. We have validated our
approach using a noisy reverberant version of the LibriSpeech dataset.
Experimental results show that Voicy outperforms other tested VC techniques in
terms of naturalness and target speaker similarity in noisy reverberant
environments.
Related papers
- Discrete Unit based Masking for Improving Disentanglement in Voice Conversion [8.337649176647645]
We introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes.
Our approach improves disentanglement and conversion performance across multiple VC methods, with 44% relative improvement in objective intelligibility.
arXiv Detail & Related papers (2024-09-17T21:17:59Z) - Learning Speech Representation From Contrastive Token-Acoustic
Pretraining [57.08426714676043]
We propose "Contrastive Token-Acoustic Pretraining (CTAP)", which uses two encoders to bring phoneme and speech into a joint multimodal space.
The proposed CTAP model is trained on 210k speech and phoneme pairs, achieving minimally-supervised TTS, VC, and ASR.
arXiv Detail & Related papers (2023-09-01T12:35:43Z) - LA-VocE: Low-SNR Audio-visual Speech Enhancement using Neural Vocoders [53.30016986953206]
We propose LA-VocE, a new two-stage approach that predicts mel-spectrograms from noisy audio-visual speech via a transformer-based architecture.
We train and evaluate our framework on thousands of speakers and 11+ different languages, and study our model's ability to adapt to different levels of background noise and speech interference.
arXiv Detail & Related papers (2022-11-20T15:27:55Z) - Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z) - Robust Disentangled Variational Speech Representation Learning for
Zero-shot Voice Conversion [34.139871476234205]
We investigate zero-shot voice conversion from a novel perspective of self-supervised disentangled speech representation learning.
A zero-shot voice conversion is performed by feeding an arbitrary speaker embedding and content embeddings to a sequential variational autoencoder (VAE) decoder.
On TIMIT and VCTK datasets, we achieve state-of-the-art performance on both objective evaluation, i.e., speaker verification (SV) on speaker embedding and content embedding, and subjective evaluation, i.e. voice naturalness and similarity, and remains to be robust even with noisy source/target utterances.
arXiv Detail & Related papers (2022-03-30T23:03:19Z) - VCVTS: Multi-speaker Video-to-Speech synthesis via cross-modal knowledge
transfer from voice conversion [77.50171525265056]
This paper proposes a novel multi-speaker Video-to-Speech (VTS) system based on cross-modal knowledge transfer from voice conversion (VC)
The Lip2Ind network can substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content.
arXiv Detail & Related papers (2022-02-18T08:58:45Z) - Zero-shot Voice Conversion via Self-supervised Prosody Representation
Learning [1.9659095632676094]
Voice Conversion (VC) for unseen speakers, also known as zero-shot VC, is an attractive topic due to its usefulness in real use-case scenarios.
We propose a novel self-supervised approach to effectively learn the prosody characteristics.
We show improved performance compared to the state-of-the-art zero-shot VC models.
arXiv Detail & Related papers (2021-10-27T13:26:52Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - NoiseVC: Towards High Quality Zero-Shot Voice Conversion [2.3224617218247126]
NoiseVC is an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC)
We conduct several experiments and demonstrate that NoiseVC has a strong disentanglement ability with a small sacrifice of quality.
arXiv Detail & Related papers (2021-04-13T10:12:38Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.