Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice
Alignment
- URL: http://arxiv.org/abs/2309.09470v1
- Date: Mon, 18 Sep 2023 04:08:02 GMT
- Title: Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice
Alignment
- Authors: Zheng-Yan Sheng, Yang Ai, Yan-Nian Chen, Zhen-Hua Ling
- Abstract summary: This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC)
To address this task, we propose a face-voice memory-based zero-shot FaceVC method.
We demonstrate the superiority of our proposed method on the zero-shot FaceVC task.
- Score: 33.55724004790504
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents a novel task, zero-shot voice conversion based on face
images (zero-shot FaceVC), which aims at converting the voice characteristics
of an utterance from any source speaker to a newly coming target speaker,
solely relying on a single face image of the target speaker. To address this
task, we propose a face-voice memory-based zero-shot FaceVC method. This method
leverages a memory-based face-voice alignment module, in which slots act as the
bridge to align these two modalities, allowing for the capture of voice
characteristics from face images. A mixed supervision strategy is also
introduced to mitigate the long-standing issue of the inconsistency between
training and inference phases for voice conversion tasks. To obtain
speaker-independent content-related representations, we transfer the knowledge
from a pretrained zero-shot voice conversion model to our zero-shot FaceVC
model. Considering the differences between FaceVC and traditional voice
conversion tasks, systematic subjective and objective metrics are designed to
thoroughly evaluate the homogeneity, diversity and consistency of voice
characteristics controlled by face images. Through extensive experiments, we
demonstrate the superiority of our proposed method on the zero-shot FaceVC
task. Samples are presented on our demo website.
Related papers
- Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion [5.483488375189695]
Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style.
Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input.
We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
arXiv Detail & Related papers (2024-09-01T11:51:18Z) - Hear Your Face: Face-based voice conversion with F0 estimation [18.66502308601214]
We present a novel face-based voice conversion framework, derived solely from an individual's facial images.
Our framework demonstrates superior speech generation quality and the ability to align facial features with voice characteristics.
arXiv Detail & Related papers (2024-08-19T08:47:03Z) - Cooperative Dual Attention for Audio-Visual Speech Enhancement with
Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE)
We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - Identity-Preserving Talking Face Generation with Landmark and Appearance
Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos.
We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures.
Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation.
We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z) - VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised
Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement.
We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training.
Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z) - Pose-Controllable Talking Face Generation by Implicitly Modularized
Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces.
We operate on raw face images, using only a single photo as an identity reference.
Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z) - MakeItTalk: Speaker-Aware Talking-Head Animation [49.77977246535329]
We present a method that generates expressive talking heads from a single facial image with audio as the only input.
Based on this intermediate representation, our method is able to synthesize photorealistic videos of entire talking heads with full range of motion.
arXiv Detail & Related papers (2020-04-27T17:56:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.