Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- URL: http://arxiv.org/abs/2405.10272v1
- Date: Thu, 16 May 2024 17:29:37 GMT
- Title: Faces that Speak: Jointly Synthesising Talking Face and Speech from Text
- Authors: Youngjoon Jang, Ji-Hoon Kim, Junseok Ahn, Doyeop Kwak, Hong-Sun Yang, Yoon-Cheol Ju, Il-Hwan Kim, Byeong-Yeol Kim, Joon Son Chung,
- Abstract summary: We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework.
We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity.
Our experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text.
- Score: 22.87082439322244
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues, we introduce a motion sampler based on conditional flow matching, which is capable of high-quality motion code generation in an efficient way. Moreover, we introduce a novel conditioning method for the TTS system, which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge, this is the first effort to build a multimodal synthesis system that can generalise to unseen identities.
Related papers
- Face2VoiceSync: Lightweight Face-Voice Consistency for Text-Driven Talking Face Generation [14.036076647627553]
Given a face image and text to speak, we generate talking face animation and its corresponding speeches.<n>We propose a novel framework, Face2VoiceSync, with several novel contributions.<n> Experiments show Face2VoiceSync achieves both visual and audio state-of-the-art performances on a single 40GB GPU.
arXiv Detail & Related papers (2025-07-25T12:49:06Z) - Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis [52.25128289155576]
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image.<n>We aim to mitigate the following three challenges in face-driven TTS systems.<n> Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
arXiv Detail & Related papers (2025-05-25T04:43:17Z) - OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios.
Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z) - Cosh-DiT: Co-Speech Gesture Video Synthesis via Hybrid Audio-Visual Diffusion Transformers [58.86974149731874]
Cosh-DiT is a Co-speech gesture video system with hybrid Diffusion Transformers.
We introduce an audio Diffusion Transformer to synthesize expressive gesture dynamics synchronized with speech rhythms.
For realistic video synthesis conditioned on the generated speech-driven motion, we design a visual Diffusion Transformer.
arXiv Detail & Related papers (2025-03-13T01:36:05Z) - EMO2: End-Effector Guided Audio-Driven Avatar Video Generation [17.816939983301474]
We propose a novel audio-driven talking head method capable of simultaneously generating highly expressive facial expressions and hand gestures.
In the first stage, we generate hand poses directly from audio input, leveraging the strong correlation between audio signals and hand movements.
In the second stage, we employ a diffusion model to synthesize video frames, incorporating the hand poses generated in the first stage to produce realistic facial expressions and body movements.
arXiv Detail & Related papers (2025-01-18T07:51:29Z) - Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters.
Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z) - AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D
Talking Face Generation [28.71632683090641]
We propose an Audio-Visual Instruction system for expressive Talking face generation.
Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information.
This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions.
arXiv Detail & Related papers (2024-02-25T15:51:05Z) - Realistic Speech-to-Face Generation with Speech-Conditioned Latent
Diffusion Model with Face Prior [13.198105709331617]
We propose a novel speech-to-face generation framework, which leverages a Speech-Conditioned Latent Diffusion Model, called SCLDM.
This is the first work to harness the exceptional modeling capabilities of diffusion models for speech-to-face generation.
We show that our method can produce more realistic face images while preserving the identity of the speaker better than state-of-the-art methods.
arXiv Detail & Related papers (2023-10-05T07:44:49Z) - SpeechX: Neural Codec Language Model as a Versatile Speech Transformer [57.82364057872905]
SpeechX is a versatile speech generation model capable of zero-shot TTS and various speech transformation tasks.
Experimental results show SpeechX's efficacy in various tasks, including zero-shot TTS, noise suppression, target speaker extraction, speech removal, and speech editing with or without background noise.
arXiv Detail & Related papers (2023-08-14T01:01:19Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Ada-TTA: Towards Adaptive High-Quality Text-to-Talking Avatar Synthesis [66.43223397997559]
We aim to synthesize high-quality talking portrait videos corresponding to the input text.
This task has broad application prospects in the digital human industry but has not been technically achieved yet.
We introduce Adaptive Text-to-Talking Avatar (Ada-TTA), which designs a generic zero-shot multi-speaker Text-to-Speech model.
arXiv Detail & Related papers (2023-06-06T08:50:13Z) - VioLA: Unified Codec Language Models for Speech Recognition, Synthesis,
and Translation [91.39949385661379]
VioLA is a single auto-regressive Transformer decoder-only network that unifies various cross-modal tasks involving speech and text.
We first convert all the speech utterances to discrete tokens using an offline neural encoder.
We further integrate task IDs (TID) and language IDs (LID) into the proposed model to enhance the modeling capability of handling different languages and tasks.
arXiv Detail & Related papers (2023-05-25T14:39:47Z) - UniFLG: Unified Facial Landmark Generator from Text or Speech [5.405714165225471]
This paper proposes a unified facial landmark generator (UniFLG) for talking face generation.
The proposed system exploits end-to-end text-to-speech and feeds it to a landmark decoder to generate facial landmarks.
We demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
arXiv Detail & Related papers (2023-02-28T06:05:43Z) - GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain
Text-to-Speech Synthesis [68.42632589736881]
This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice.
GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components.
Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity.
arXiv Detail & Related papers (2022-05-15T08:16:02Z) - AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary
Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input.
Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z) - Generating coherent spontaneous speech and gesture from text [21.90157862281996]
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements)
Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data.
We put these two state-of-the-art technologies together in a coherent fashion for the first time.
arXiv Detail & Related papers (2021-01-14T16:02:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.