An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion
- URL: http://arxiv.org/abs/2104.12159v1
- Date: Sun, 25 Apr 2021 13:44:32 GMT
- Title: An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion
- Authors: Sandipan Dhar, Nanda Dulal Jana, Swagatam Das
- Abstract summary: We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers.
The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset.
A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
- Score: 9.703390665821463
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Voice Conversion (VC) emerged as a significant domain of research in the
field of speech synthesis in recent years due to its emerging application in
voice-assisting technology, automated movie dubbing, and speech-to-singing
conversion to name a few. VC basically deals with the conversion of vocal style
of one speaker to another speaker while keeping the linguistic contents
unchanged. VC task is performed through a three-stage pipeline consisting of
speech analysis, speech feature mapping, and speech reconstruction. Nowadays
the Generative Adversarial Network (GAN) models are widely in use for speech
feature mapping from source to target speaker. In this paper, we propose an
adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one
VC of speakers. Our ALGAN-VC framework consists of some approaches to improve
the speech quality and voice similarity between source and target speakers. The
model incorporates a Dense Residual Network (DRN) like architecture to the
generator network for efficient speech feature learning, for source to target
speech feature conversion. We also integrate an adaptive learning mechanism to
compute the loss function for the proposed model. Moreover, we use a boosted
learning rate approach to enhance the learning capability of the proposed
model. The model is trained by using both forward and inverse mapping
simultaneously for a one-to-one VC. The proposed model is tested on Voice
Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our
self-prepared speech dataset, which has been recorded in Indian regional
languages and in English. A subjective and objective evaluation of the
generated speech samples indicated that the proposed model elegantly performed
the voice conversion task by achieving high speaker similarity and adequate
speech quality.
Related papers
- Non-autoregressive real-time Accent Conversion model with voice cloning [0.0]
We have developed a non-autoregressive model for real-time accent conversion with voice cloning.
The model generates native-sounding L1 speech with minimal latency based on input L2 speech.
The model has the ability to save, clone and change the timbre, gender and accent of the speaker's voice in real time.
arXiv Detail & Related papers (2024-05-21T19:07:26Z) - Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer [59.57249127943914]
We present a multilingual Audio-Visual Speech Recognition model incorporating several enhancements to improve performance and audio noise robustness.
We increase the amount of audio-visual training data for six distinct languages, generating automatic transcriptions of unlabelled multilingual datasets.
Our proposed model achieves new state-of-the-art performance on the LRS3 dataset, reaching WER of 0.8%.
arXiv Detail & Related papers (2024-03-14T01:16:32Z) - Cross-lingual Knowledge Distillation via Flow-based Voice Conversion for
Robust Polyglot Text-To-Speech [6.243356997302935]
We introduce a framework for cross-lingual speech synthesis, which involves an upstream Voice Conversion (VC) model and a downstream Text-To-Speech (TTS) model.
In the first two stages, we use a VC model to convert utterances in the target locale to the voice of the target speaker.
In the third stage, the converted data is combined with the linguistic features and durations from recordings in the target language, which are then used to train a single-speaker acoustic model.
arXiv Detail & Related papers (2023-09-15T09:03:14Z) - AudioPaLM: A Large Language Model That Can Speak and Listen [79.44757696533709]
We introduce AudioPaLM, a large language model for speech understanding and generation.
AudioPaLM fuses text-based and speech-based language models.
It can process and generate text and speech with applications including speech recognition and speech-to-speech translation.
arXiv Detail & Related papers (2023-06-22T14:37:54Z) - Cross-lingual Text-To-Speech with Flow-based Voice Conversion for
Improved Pronunciation [11.336431583289382]
This paper presents a method for end-to-end cross-lingual text-to-speech.
It aims to preserve the target language's pronunciation regardless of the original speaker's language.
arXiv Detail & Related papers (2022-10-31T12:44:53Z) - ASR data augmentation in low-resource settings using cross-lingual
multi-speaker TTS and cross-lingual voice conversion [49.617722668505834]
We show that our approach permits the application of speech synthesis and voice conversion to improve ASR systems using only one target-language speaker during model training.
It is possible to obtain promising ASR training results with our data augmentation method using only a single real speaker in a target language.
arXiv Detail & Related papers (2022-03-29T11:55:30Z) - On Prosody Modeling for ASR+TTS based Voice Conversion [82.65378387724641]
In voice conversion, an approach showing promising results in the latest voice conversion challenge (VCC) 2020 is to first use an automatic speech recognition (ASR) model to transcribe the source speech into the underlying linguistic contents.
Such a paradigm, referred to as ASR+TTS, overlooks the modeling of prosody, which plays an important role in speech naturalness and conversion similarity.
We propose to directly predict prosody from the linguistic representation in a target-speaker-dependent manner, referred to as target text prediction (TTP)
arXiv Detail & Related papers (2021-07-20T13:30:23Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Many-to-Many Voice Transformer Network [55.17770019619078]
This paper proposes a voice conversion (VC) method based on a sequence-to-sequence (S2S) learning framework.
It enables simultaneous conversion of the voice characteristics, pitch contour, and duration of input speech.
arXiv Detail & Related papers (2020-05-18T04:02:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.