AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person
- URL: http://arxiv.org/abs/2108.04325v2
- Date: Wed, 11 Aug 2021 08:19:36 GMT
- Title: AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person
- Authors: Xinsheng Wang, Qicong Xie, Jihua Zhu, Lei Xie, Scharenborg
- Abstract summary: We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
- Score: 21.126759304401627
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Automatically generating videos in which synthesized speech is synchronized
with lip movements in a talking head has great potential in many human-computer
interaction scenarios. In this paper, we present an automatic method to
generate synchronized speech and talking-head videos on the basis of text and a
single face image of an arbitrary person as input. In contrast to previous
text-driven talking head generation methods, which can only synthesize the
voice of a specific person, the proposed method is capable of synthesizing
speech for any person that is inaccessible in the training stage. Specifically,
the proposed method decomposes the generation of synchronized speech and
talking head videos into two stages, i.e., a text-to-speech (TTS) stage and a
speech-driven talking head generation stage. The proposed TTS module is a
face-conditioned multi-speaker TTS model that gets the speaker identity
information from face images instead of speech, which allows us to synthesize a
personalized voice on the basis of the input face image. To generate the
talking head videos from the face images, a facial landmark-based method that
can predict both lip movements and head rotations is proposed. Extensive
experiments demonstrate that the proposed method is able to generate
synchronized speech and talking head videos for arbitrary persons and
non-persons. Synthesized speech shows consistency with the given face regarding
to the synthesized voice's timbre and one's appearance in the image, and the
proposed landmark-based talking head method outperforms the state-of-the-art
landmark-based method on generating natural talking head videos.
Related papers
- Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner.
We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - Zero-shot personalized lip-to-speech synthesis with face image based
voice control [41.17483247506426]
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies.
We propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities.
arXiv Detail & Related papers (2023-05-09T02:37:29Z) - Imaginary Voice: Face-styled Diffusion Model for Text-to-Speech [33.01930038988336]
We introduce a face-styled diffusion text-to-speech (TTS) model within a unified framework, called Face-TTS.
We jointly train cross-model biometrics and TTS models to preserve speaker identity between face images and generated speech segments.
Since the biometric information is extracted directly from the face image, our method does not require extra fine-tuning steps to generate speech from unseen and unheard speakers.
arXiv Detail & Related papers (2023-02-27T11:59:28Z) - VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via
Speech-Visage Feature Selection [32.65865343643458]
Recent studies have shown impressive performance on synthesizing speech from silent talking face videos.
We introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video.
Proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given.
arXiv Detail & Related papers (2022-06-15T11:29:58Z) - VisualTTS: TTS with Accurate Lip-Speech Synchronization for Automatic
Voice Over [68.22776506861872]
We formulate a novel task to synthesize speech in sync with a silent pre-recorded video, denoted as automatic voice over (AVO)
A natural solution to AVO is to condition the speech rendering on the temporal progression of lip sequence in the video.
We propose a novel text-to-speech model that is conditioned on visual input, named VisualTTS, for accurate lip-speech synchronization.
arXiv Detail & Related papers (2021-10-07T11:25:25Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z) - Generating coherent spontaneous speech and gesture from text [21.90157862281996]
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements)
Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data.
We put these two state-of-the-art technologies together in a coherent fashion for the first time.
arXiv Detail & Related papers (2021-01-14T16:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.