Towards General-Purpose Text-Instruction-Guided Voice Conversion
- URL: http://arxiv.org/abs/2309.14324v2
- Date: Tue, 16 Jan 2024 13:53:56 GMT
- Title: Towards General-Purpose Text-Instruction-Guided Voice Conversion
- Authors: Chun-Yi Kuan, Chen An Li, Tsu-Yuan Hsu, Tse-Yang Lin, Ho-Lam Chung,
Kai-Wei Chang, Shuo-yiin Chang, Hung-yi Lee
- Abstract summary: This paper introduces a novel voice conversion model, guided by text instructions such as "articulate slowly with a deep tone" or "speak in a cheerful boyish voice"
The proposed VC model is a neural language model which processes a sequence of discrete codes, resulting in the code sequence of converted speech.
- Score: 84.78206348045428
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper introduces a novel voice conversion (VC) model, guided by text
instructions such as "articulate slowly with a deep tone" or "speak in a
cheerful boyish voice". Unlike traditional methods that rely on reference
utterances to determine the attributes of the converted speech, our model adds
versatility and specificity to voice conversion. The proposed VC model is a
neural codec language model which processes a sequence of discrete codes,
resulting in the code sequence of converted speech. It utilizes text
instructions as style prompts to modify the prosody and emotional information
of the given speech. In contrast to previous approaches, which often rely on
employing separate encoders like prosody and content encoders to handle
different aspects of the source speech, our model handles various information
of speech in an end-to-end manner. Experiments have demonstrated the impressive
capabilities of our model in comprehending instructions and delivering
reasonable results.
Related papers
- SelfVC: Voice Conversion With Iterative Refinement using Self Transformations [42.97689861071184]
SelfVC is a training strategy to improve a voice conversion model with self-synthesized examples.
We develop techniques to derive prosodic information from the audio signal and SSL representations to train predictive submodules in the synthesis model.
Our framework is trained without any text and achieves state-of-the-art results in zero-shot voice conversion on metrics evaluating naturalness, speaker similarity, and intelligibility of synthesized audio.
arXiv Detail & Related papers (2023-10-14T19:51:17Z) - Generative Adversarial Training for Text-to-Speech Synthesis Based on
Raw Phonetic Input and Explicit Prosody Modelling [0.36868085124383626]
We describe an end-to-end speech synthesis system that uses generative adversarial training.
We train our Vocoder for raw phoneme-to-audio conversion, using explicit phonetic, pitch and duration modeling.
arXiv Detail & Related papers (2023-10-14T18:15:51Z) - SpeechUT: Bridging Speech and Text with Hidden-Unit for Encoder-Decoder
Based Speech-Text Pre-training [106.34112664893622]
We propose a unified-modal speech-unit-text pre-training model, SpeechUT, to connect the representations of a speech encoder and a text decoder with a shared unit encoder.
Our proposed SpeechUT is fine-tuned and evaluated on automatic speech recognition (ASR) and speech translation (ST) tasks.
arXiv Detail & Related papers (2022-10-07T17:57:45Z) - Speech Representation Disentanglement with Adversarial Mutual
Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic.
We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components.
Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z) - TGAVC: Improving Autoencoder Voice Conversion with Text-Guided and
Adversarial Training [32.35100329067037]
Novel voice conversion framework named $boldsymbol T$ext $boldsymbol G$uided $boldsymbol A$utoVC(TGAVC)
adversarial training is applied to eliminate the speaker identity information in the estimated content embedding extracted from speech.
Experiments on AIShell-3 dataset show that the proposed model outperforms AutoVC in terms of naturalness and similarity of converted speech.
arXiv Detail & Related papers (2022-08-08T10:33:36Z) - SpeechT5: Unified-Modal Encoder-Decoder Pre-training for Spoken Language
Processing [77.4527868307914]
We propose a unified-modal SpeechT5 framework that explores the encoder-decoder pre-training for self-supervised speech/text representation learning.
The SpeechT5 framework consists of a shared encoder-decoder network and six modal-specific (speech/text) pre/post-nets.
To align the textual and speech information into a unified semantic space, we propose a cross-modal vector quantization method with random mixing-up to bridge speech and text.
arXiv Detail & Related papers (2021-10-14T07:59:27Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Bridging the Modality Gap for Speech-to-Text Translation [57.47099674461832]
End-to-end speech translation aims to translate speech in one language into text in another language via an end-to-end way.
Most existing methods employ an encoder-decoder structure with a single encoder to learn acoustic representation and semantic information simultaneously.
We propose a Speech-to-Text Adaptation for Speech Translation model which aims to improve the end-to-end model performance by bridging the modality gap between speech and text.
arXiv Detail & Related papers (2020-10-28T12:33:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.