V2C: Visual Voice Cloning
- URL: http://arxiv.org/abs/2111.12890v1
- Date: Thu, 25 Nov 2021 03:35:18 GMT
- Title: V2C: Visual Voice Cloning
- Authors: Qi Chen, Yuanqing Li, Yuankai Qi, Jiaqiu Zhou, Mingkui Tan, Qi Wu
- Abstract summary: We propose a new task named Visual Voice Cloning (V2C)
V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video.
Our dataset contains 10,217 animated movie clips covering a large variety of genres.
- Score: 55.55301826567474
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Existing Voice Cloning (VC) tasks aim to convert a paragraph text to a speech
with desired voice specified by a reference audio. This has significantly
boosted the development of artificial speech applications. However, there also
exist many scenarios that cannot be well reflected by these VC tasks, such as
movie dubbing, which requires the speech to be with emotions consistent with
the movie plots. To fill this gap, in this work we propose a new task named
Visual Voice Cloning (V2C), which seeks to convert a paragraph of text to a
speech with both desired voice specified by a reference audio and desired
emotion specified by a reference video. To facilitate research in this field,
we construct a dataset, V2C-Animation, and propose a strong baseline based on
existing state-of-the-art (SoTA) VC techniques. Our dataset contains 10,217
animated movie clips covering a large variety of genres (e.g., Comedy, Fantasy)
and emotions (e.g., happy, sad). We further design a set of evaluation metrics,
named MCD-DTW-SL, which help evaluate the similarity between ground-truth
speeches and the synthesised ones. Extensive experimental results show that
even SoTA VC methods cannot generate satisfying speeches for our V2C task. We
hope the proposed new task together with the constructed dataset and evaluation
metric will facilitate the research in the field of voice cloning and the
broader vision-and-language community.
Related papers
- Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion [5.483488375189695]
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style.
Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input.
We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
arXiv Detail & Related papers (2024-09-01T11:51:18Z) - UnifySpeech: A Unified Framework for Zero-shot Text-to-Speech and Voice
Conversion [63.346825713704625]
Text-to-speech (TTS) and voice conversion (VC) are two different tasks aiming at generating high quality speaking voice according to different input modality.
This paper proposes UnifySpeech, which brings TTS and VC into a unified framework for the first time.
arXiv Detail & Related papers (2023-01-10T06:06:57Z) - Learning to Dub Movies via Hierarchical Prosody Models [167.6465354313349]
Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference.
We propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modelling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene.
arXiv Detail & Related papers (2022-12-08T03:29:04Z) - Decoupling Speaker-Independent Emotions for Voice Conversion Via
Source-Filter Networks [14.55242023708204]
We propose a novel Source-Filter-based Emotional VC model (SFEVC) to achieve proper filtering of speaker-independent emotion features.
Our SFEVC model consists of multi-channel encoders, emotion separate encoders, and one decoder.
arXiv Detail & Related papers (2021-10-04T03:14:48Z) - Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration [62.75234183218897]
We propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the speaker.
We generate the mel-spectrogram of the edited speech with a transformer-based decoder.
It outperforms a recent zero-shot TTS engine by a large margin.
arXiv Detail & Related papers (2021-09-12T04:17:53Z) - Limited Data Emotional Voice Conversion Leveraging Text-to-Speech:
Two-stage Sequence-to-Sequence Training [91.95855310211176]
Emotional voice conversion aims to change the emotional state of an utterance while preserving the linguistic content and speaker identity.
We propose a novel 2-stage training strategy for sequence-to-sequence emotional voice conversion with a limited amount of emotional speech data.
The proposed framework can perform both spectrum and prosody conversion and achieves significant improvement over the state-of-the-art baselines in both objective and subjective evaluation.
arXiv Detail & Related papers (2021-03-31T04:56:14Z) - FragmentVC: Any-to-Any Voice Conversion by End-to-End Extracting and
Fusing Fine-Grained Voice Fragments With Attention [66.77490220410249]
We propose FragmentVC, in which the latent phonetic structure of the utterance from the source speaker is obtained from Wav2Vec 2.0.
FragmentVC is able to extract fine-grained voice fragments from the target speaker utterance(s) and fuse them into the desired utterance.
This approach is trained with reconstruction loss only without any disentanglement considerations between content and speaker information.
arXiv Detail & Related papers (2020-10-27T09:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.