Zero-shot personalized lip-to-speech synthesis with face image based
voice control
- URL: http://arxiv.org/abs/2305.14359v1
- Date: Tue, 9 May 2023 02:37:29 GMT
- Title: Zero-shot personalized lip-to-speech synthesis with face image based
voice control
- Authors: Zheng-Yan Sheng, Yang Ai, Zhen-Hua Ling
- Abstract summary: Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
- Score: 41.17483247506426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech
from talking face images, has witnessed significant progress with various
models and training strategies in a series of independent studies. However,
existing studies can not achieve voice control under zero-shot condition,
because extra speaker embeddings need to be extracted from natural reference
speech and are unavailable when only the silent video of an unseen speaker is
given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis
method, in which face images control speaker identities. A variational
autoencoder is adopted to disentangle the speaker identity and linguistic
content representations, which enables speaker embeddings to control the voice
characteristics of synthetic speech for unseen speakers. Furthermore, we
propose associated cross-modal representation learning to promote the ability
of face-based speaker embeddings (FSE) on voice control. Extensive experiments
verify the effectiveness of the proposed method whose synthetic utterances are
more natural and matching with the personality of input video than the compared
methods. To our best knowledge, this paper makes the first attempt on zero-shot
personalized Lip2Speech synthesis with a face image rather than reference audio
to control voice characteristics.
Related papers
- Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - ACE-VC: Adaptive and Controllable Voice Conversion using Explicitly
Disentangled Self-supervised Speech Representations [12.20522794248598]
We propose a zero-shot voice conversion method using speech representations trained with self-supervised learning.
We develop a multi-task model to decompose a speech utterance into features such as linguistic content, speaker characteristics, and speaking style.
Next, we develop a synthesis model with pitch and duration predictors that can effectively reconstruct the speech signal from its representation.
arXiv Detail & Related papers (2023-02-16T08:10:41Z) - Show Me Your Face, And I'll Tell You How You Speak [0.0]
We explore the task of lip to speech synthesis, i.e., learning to generate speech given only the lip movements of a speaker.
We present a novel method "Lip2Speech", with key design choices to achieve accurate lip to speech synthesis in unconstrained scenarios.
arXiv Detail & Related papers (2022-06-28T13:52:47Z) - VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via
Speech-Visage Feature Selection [32.65865343643458]
Recent studies have shown impressive performance on synthesizing speech from silent talking face videos.
We introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video.
Proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given.
arXiv Detail & Related papers (2022-06-15T11:29:58Z) - Audio-Visual Speech Codecs: Rethinking Audio-Visual Speech Enhancement
by Re-Synthesis [67.73554826428762]
We propose a novel audio-visual speech enhancement framework for high-fidelity telecommunications in AR/VR.
Our approach leverages audio-visual speech cues to generate the codes of a neural speech, enabling efficient synthesis of clean, realistic speech from noisy signals.
arXiv Detail & Related papers (2022-03-31T17:57:10Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - Facetron: Multi-speaker Face-to-Speech Model based on Cross-modal Latent
Representations [22.14238843571225]
We propose an effective method to synthesize speaker-specific speech waveforms by conditioning on videos of an individual's face.
The linguistic features are extracted from lip movements using a lip-reading model, and the speaker characteristic features are predicted from face images.
We show the superiority of our proposed model over conventional methods in terms of both objective and subjective evaluation results.
arXiv Detail & Related papers (2021-07-26T07:36:02Z) - VisualVoice: Audio-Visual Speech Separation with Cross-Modal Consistency [111.55430893354769]
Given a video, the goal is to extract the speech associated with a face in spite of simultaneous background sounds and/or other human speakers.
Our approach jointly learns audio-visual speech separation and cross-modal speaker embeddings from unlabeled video.
It yields state-of-the-art results on five benchmark datasets for audio-visual speech separation and enhancement.
arXiv Detail & Related papers (2021-01-08T18:25:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.