StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized
by Automatic Speech Recognition
- URL: http://arxiv.org/abs/2108.04395v1
- Date: Tue, 10 Aug 2021 01:18:31 GMT
- Title: StarGAN-VC+ASR: StarGAN-based Non-Parallel Voice Conversion Regularized
by Automatic Speech Recognition
- Authors: Shoki Sakamoto, Akira Taniguchi, Tadahiro Taniguchi, Hirokazu Kameoka
- Abstract summary: We propose the use of automatic speech recognition to assist model training.
We show that using our proposed method, StarGAN-VC can retain more linguistic information than vanilla StarGAN-VC.
- Score: 23.75478998795749
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Preserving the linguistic content of input speech is essential during voice
conversion (VC). The star generative adversarial network-based VC method
(StarGAN-VC) is a recently developed method that allows non-parallel
many-to-many VC. Although this method is powerful, it can fail to preserve the
linguistic content of input speech when the number of available training
samples is extremely small. To overcome this problem, we propose the use of
automatic speech recognition to assist model training, to improve StarGAN-VC,
especially in low-resource scenarios.
Experimental results show that using our proposed method, StarGAN-VC can
retain more linguistic information than vanilla StarGAN-VC.
Related papers
- Language-agnostic Code-Switching in Sequence-To-Sequence Speech
Recognition [62.997667081978825]
Code-Switching (CS) is referred to the phenomenon of alternately using words and phrases from different languages.
We propose a simple yet effective data augmentation in which audio and corresponding labels of different source languages are transcribed.
We show that this augmentation can even improve the model's performance on inter-sentential language switches not seen during training by 5,03% WER.
arXiv Detail & Related papers (2022-10-17T12:15:57Z) - HiFi-VC: High Quality ASR-Based Voice Conversion [0.0]
We propose a new any-to-any voice conversion pipeline.
Our approach uses automated speech recognition features, pitch tracking, and a state-of-the-art waveform prediction model.
arXiv Detail & Related papers (2022-03-31T10:45:32Z) - Voice Conversion Can Improve ASR in Very Low-Resource Settings [32.170748231414365]
We study whether a VC system can be used cross-lingually to improve low-resource speech recognition.
We combine several recent techniques to design and train a practical VC system in English.
We find that when using a sensible amount of augmented data, speech recognition performance is improved in all four low-resource languages considered.
arXiv Detail & Related papers (2021-11-04T07:57:00Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant
Environments [76.98764900754111]
Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker.
We propose Voicy, a new VC framework particularly tailored for noisy speech.
Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder.
arXiv Detail & Related papers (2021-06-16T15:47:06Z) - StarGAN-ZSVC: Towards Zero-Shot Voice Conversion in Low-Resource
Contexts [32.170748231414365]
To be useful in a wider range of contexts, voice conversion systems need to be trainable without access to parallel data.
This paper extends recent voice conversion models based on generative adversarial networks (GANs)
We show that real-time zero-shot voice conversion is possible even for a model trained on very little data.
arXiv Detail & Related papers (2021-05-31T18:21:28Z) - DiffSVC: A Diffusion Probabilistic Model for Singing Voice Conversion [51.83469048737548]
We propose DiffSVC, an SVC system based on denoising diffusion probabilistic model.
A denoising module is trained in DiffSVC, which takes destroyed mel spectrogram and its corresponding step information as input to predict the added Gaussian noise.
Experiments show that DiffSVC can achieve superior conversion performance in terms of naturalness and voice similarity to current state-of-the-art SVC approaches.
arXiv Detail & Related papers (2021-05-28T14:26:40Z) - An Adaptive Learning based Generative Adversarial Network for One-To-One
Voice Conversion [9.703390665821463]
We propose an adaptive learning-based GAN model called ALGAN-VC for an efficient one-to-one VC of speakers.
The model is tested on Voice Conversion Challenge (VCC) 2016, 2018, and 2020 datasets as well as on our self-prepared speech dataset.
A subjective and objective evaluation of the generated speech samples indicated that the proposed model elegantly performed the voice conversion task.
arXiv Detail & Related papers (2021-04-25T13:44:32Z) - Nonparallel Voice Conversion with Augmented Classifier Star Generative
Adversarial Networks [41.87886753817764]
We previously proposed a method that allows for nonparallel voice conversion (VC) by using a variant of generative adversarial networks (GANs) called StarGAN.
The main features of our method, called StarGAN-VC, are as follows: First, it requires no parallel utterances, transcriptions, or time alignment procedures for speech generator training.
We describe three formulations of StarGAN, including a newly introduced novel StarGAN variant called "Augmented classifier StarGAN (A-StarGAN)", and compare them in a nonparallel VC task.
arXiv Detail & Related papers (2020-08-27T10:30:05Z) - Building Low-Resource NER Models Using Non-Speaker Annotation [58.78968578460793]
Cross-lingual methods have had notable success in addressing these concerns.
We propose a complementary approach to building low-resource Named Entity Recognition (NER) models using non-speaker'' (NS) annotations.
We show that use of NS annotators produces results that are consistently on par or better than cross-lingual methods built on modern contextual representations.
arXiv Detail & Related papers (2020-06-17T03:24:38Z) - VQVC+: One-Shot Voice Conversion by Vector Quantization and U-Net
architecture [71.45920122349628]
Auto-encoder-based VC methods disentangle the speaker and the content in input speech without given the speaker's identity.
We use the U-Net architecture within an auto-encoder-based VC system to improve audio quality.
arXiv Detail & Related papers (2020-06-07T14:01:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.