Related papers: Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion

URL: http://arxiv.org/abs/2409.00700v1
Date: Sun, 1 Sep 2024 11:51:18 GMT
Title: Seeing Your Speech Style: A Novel Zero-Shot Identity-Disentanglement Face-based Voice Conversion
Authors: Yan Rong, Li Liu,
Abstract summary: Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. We present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations.
Score: 5.483488375189695
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Face-based Voice Conversion (FVC) is a novel task that leverages facial images to generate the target speaker's voice style. Previous work has two shortcomings: (1) suffering from obtaining facial embeddings that are well-aligned with the speaker's voice identity information, and (2) inadequacy in decoupling content and speaker identity information from the audio input. To address these issues, we present a novel FVC method, Identity-Disentanglement Face-based Voice Conversion (ID-FaceVC), which overcomes the above two limitations. More precisely, we propose an Identity-Aware Query-based Contrastive Learning (IAQ-CL) module to extract speaker-specific facial features, and a Mutual Information-based Dual Decoupling (MIDD) module to purify content features from audio, ensuring clear and high-quality voice conversion. Besides, unlike prior works, our method can accept either audio or text inputs, offering controllable speech generation with adjustable emotional tone and speed. Extensive experiments demonstrate that ID-FaceVC achieves state-of-the-art performance across various metrics, with qualitative and user study results confirming its effectiveness in naturalness, similarity, and diversity. Project website with audio samples and code can be found at https://id-facevc.github.io.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Mask-Free Audio-driven Talking Face Generation for Enhanced Visual Quality and Identity Preservation [54.52905471078152]
We propose a mask-free talking face generation approach while maintaining the 2D-based face editing task.<n>We transform the input images to have closed mouths, using a two-step landmark-based approach trained in an unpaired manner.
arXiv Detail & Related papers (2025-07-28T16:03:36Z)
Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation [14.036076647627553]
Given a face image and text to speak, we generate talking face animation and its corresponding speeches.<n>We propose a novel framework, Face2VoiceSync, with several novel contributions.<n> Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.
arXiv Detail & Related papers (2025-07-25T12:49:06Z)
MuteSwap: Visual-informed Silent Video Identity Conversion [18.395223784732806]
We introduce Silent Face-based Voice Conversion (SFVC)<n>SFVC generates intelligible speech and converting identity using only visual cues.<n>MuteSwap is a novel framework that employs contrastive learning to align cross-modality identities.
arXiv Detail & Related papers (2025-07-01T07:13:34Z)
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis [52.25128289155576]
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image.<n>We aim to mitigate the following three challenges in face-driven TTS systems.<n> Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
arXiv Detail & Related papers (2025-05-25T04:43:17Z)
MAVFlow: Preserving Paralinguistic Elements with Conditional Flow Matching for Zero-Shot AV2AV Multilingual Translation [23.892686638994043]
We propose a conditional flow matching (CFM) model for zero-shot audio-visual translation. By leveraging multi-modal guidance with CFM, our model robustly preserves speaker-specific characteristics. We empirically demonstrate the inclusion of high-quality mel-spectrograms conditioned on facial information.
arXiv Detail & Related papers (2025-03-14T02:48:43Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
Cooperative Dual Attention for Audio-Visual Speech Enhancement with Facial Cues [80.53407593586411]
We focus on leveraging facial cues beyond the lip region for robust Audio-Visual Speech Enhancement (AVSE) We propose a Dual Attention Cooperative Framework, DualAVSE, to ignore speech-unrelated information, capture speech-related information with facial cues, and dynamically integrate it with the audio signal for AVSE.
arXiv Detail & Related papers (2023-11-24T04:30:31Z)
Face-Driven Zero-Shot Voice Conversion with Memory-based Face-Voice Alignment [33.55724004790504]
This paper presents a novel task, zero-shot voice conversion based on face images (zero-shot FaceVC) To address this task, we propose a face-voice memory-based zero-shot FaceVC method. We demonstrate the superiority of our proposed method on the zero-shot FaceVC task.
arXiv Detail & Related papers (2023-09-18T04:08:02Z)
Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text. This task has broad application prospects in the digital human industry but has not been technically achieved yet. We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z)
DeID-VC: Speaker De-identification via Zero-shot Pseudo Voice Conversion [0.0]
DeID-VC is a speaker de-identification system that converts a real speaker to pseudo speakers. With the help of PSG, DeID-VC can assign unique pseudo speakers at speaker level or even at utterance level.
arXiv Detail & Related papers (2022-09-09T21:13:08Z)
Speech Representation Disentanglement with Adversarial Mutual Information Learning for One-shot Voice Conversion [42.43123253495082]
One-shot voice conversion (VC) with only a single target speaker's speech for reference has become a hot research topic. We employ random resampling for pitch and content encoder and use the variational contrastive log-ratio upper bound of mutual information to disentangle speech components. Experiments on the VCTK dataset show the model achieves state-of-the-art performance for one-shot VC in terms of naturalness and intellgibility.
arXiv Detail & Related papers (2022-08-18T10:36:27Z)
V2C: Visual Voice Cloning [55.55301826567474]
We propose a new task named Visual Voice Cloning (V2C) V2C seeks to convert a paragraph of text to a speech with both desired voice specified by a reference audio and desired emotion specified by a reference video. Our dataset contains 10,217 animated movie clips covering a large variety of genres.
arXiv Detail & Related papers (2021-11-25T03:35:18Z)
Controlled AutoEncoders to Generate Faces from Voices [30.062970046955577]
We propose a framework to morph a target face in response to a given voice in a way that facial features are implicitly guided by learned voice-face correlation. We evaluate the framework on VoxCelab and VGGFace datasets through human subjects and face retrieval.
arXiv Detail & Related papers (2021-07-16T16:04:29Z)
VQMIVC: Vector Quantization and Mutual Information-Based Unsupervised Speech Representation Disentanglement for One-shot Voice Conversion [54.29557210925752]
One-shot voice conversion can be effectively achieved by speech representation disentanglement. We employ vector quantization (VQ) for content encoding and introduce mutual information (MI) as the correlation metric during training. Experimental results reflect the superiority of the proposed method in learning effective disentangled speech representations.
arXiv Detail & Related papers (2021-06-18T13:50:38Z)
Pose-Controllable Talking Face Generation by Implicitly Modularized Audio-Visual Representation [96.66010515343106]
We propose a clean yet effective framework to generate pose-controllable talking faces. We operate on raw face images, using only a single photo as an identity reference. Our model has multiple advanced capabilities including extreme view robustness and talking face frontalization.
arXiv Detail & Related papers (2021-04-22T15:10:26Z)
VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers. Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video. It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.