Related papers: UniFLG: Unified Facial Landmark Generator from Text or Speech

UniFLG: Unified Facial Landmark Generator from Text or Speech

URL: http://arxiv.org/abs/2302.14337v2
Date: Fri, 19 May 2023 02:43:32 GMT
Title: UniFLG: Unified Facial Landmark Generator from Text or Speech
Authors: Kentaro Mitsui, Yukiya Hono, Kei Sawada
Abstract summary: This paper proposes a unified facial landmark generator (UniFLG) for talking face generation. The proposed system exploits end-to-end text-to-speech and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.
Score: 5.405714165225471
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Talking face generation has been extensively investigated owing to its wide applicability. The two primary frameworks used for talking face generation comprise a text-driven framework, which generates synchronized speech and talking faces from text, and a speech-driven framework, which generates talking faces from speech. To integrate these frameworks, this paper proposes a unified facial landmark generator (UniFLG). The proposed system exploits end-to-end text-to-speech not only for synthesizing speech but also for extracting a series of latent representations that are common to text and speech, and feeds it to a landmark decoder to generate facial landmarks. We demonstrate that our system achieves higher naturalness in both speech synthesis and facial landmark generation compared to the state-of-the-art text-driven method. We further demonstrate that our system can generate facial landmarks from speech of speakers without facial video data or even speech data.

Related papers

Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Revival with Voice: Multi-modal Controllable Text-to-Speech Synthesis [52.25128289155576]
This paper explores multi-modal controllable Text-to-Speech Synthesis (TTS) where the voice can be generated from face image.<n>We aim to mitigate the following three challenges in face-driven TTS systems.<n> Experimental results validate the proposed model's effectiveness in face-driven voice synthesis.
arXiv Detail & Related papers (2025-05-25T04:43:17Z)
OmniTalker: Real-Time Text-Driven Talking Head Generation with In-Context Audio-Visual Style Replication [19.688375369516923]
We introduce an end-to-end unified framework that simultaneously generates synchronized speech and talking head videos from text and reference video in real-time zero-shot scenarios. Our method surpasses existing approaches in generation quality, particularly excelling in style preservation and audio-video synchronization.
arXiv Detail & Related papers (2025-04-03T09:48:13Z)
MoCha: Towards Movie-Grade Talking Character Synthesis [62.007000023747445]
We introduce Talking Characters, a more realistic task to generate talking character animations directly from speech and text. Unlike talking head, Talking Characters aims at generating the full portrait of one or more characters beyond the facial region. We propose MoCha, the first of its kind to generate talking characters.
arXiv Detail & Related papers (2025-03-30T04:22:09Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
Faces that Speak: Jointly Synthesising Talking Face and Speech from Text [22.87082439322244]
We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios, and (2) ensuring voice consistency despite variations in facial motion for the same identity. Our experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text.
arXiv Detail & Related papers (2024-05-16T17:29:37Z)
Face-StyleSpeech: Enhancing Zero-shot Speech Synthesis from Face Images with Improved Face-to-Speech Mapping [37.57813713418656]
We propose Face-StyleSpeech, a zero-shot Text-To-Speech model that generates natural speech conditioned on a face image. We show that Face-StyleSpeech effectively generates more natural speech from a face image than baselines, even for unseen faces.
arXiv Detail & Related papers (2023-09-25T13:46:00Z)
ContextSpeech: Expressive and Efficient Text-to-Speech for Paragraph Reading [65.88161811719353]
This work develops a lightweight yet effective Text-to-Speech system, ContextSpeech. We first design a memory-cached recurrence mechanism to incorporate global text and speech context into sentence encoding. We construct hierarchically-structured textual semantics to broaden the scope for global context enhancement. Experiments show that ContextSpeech significantly improves the voice quality and prosody in paragraph reading with competitive model efficiency.
arXiv Detail & Related papers (2023-07-03T06:55:03Z)
Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models [64.14812728562596]
We present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. We can easily generate face videos that articulate the provided textual sentences.
arXiv Detail & Related papers (2023-06-28T08:22:53Z)
Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication. We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z)
Identity-Preserving Talking Face Generation with Landmark and Appearance Priors [106.79923577700345]
Existing person-generic methods have difficulty in generating realistic and lip-synced videos. We propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. Our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods.
arXiv Detail & Related papers (2023-05-15T01:31:32Z)
VisageSynTalk: Unseen Speaker Video-to-Speech Synthesis via Speech-Visage Feature Selection [32.65865343643458]
Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. We introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. Proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given.
arXiv Detail & Related papers (2022-06-15T11:29:58Z)
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person [21.126759304401627]
We present an automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input. Experiments demonstrate that the proposed method is able to generate synchronized speech and talking head videos for arbitrary persons and non-persons.
arXiv Detail & Related papers (2021-08-09T19:58:38Z)
Direct speech-to-speech translation with discrete units [64.19830539866072]
We present a direct speech-to-speech translation (S2ST) model that translates speech from one language to speech in another language without relying on intermediate text generation. We propose to predict the self-supervised discrete representations learned from an unlabeled speech corpus instead. When target text transcripts are available, we design a multitask learning framework with joint speech and text training that enables the model to generate dual mode output (speech and text) simultaneously in the same inference pass.
arXiv Detail & Related papers (2021-07-12T17:40:43Z)
Speech2Video: Cross-Modal Distillation for Speech to Video Generation [21.757776580641902]
Speech-to-video generation technique can spark interesting applications in entertainment, customer service, and human-computer-interaction industries. The challenge mainly lies in disentangling the distinct visual attributes from audio signals. We propose a light-weight, cross-modal distillation method to extract disentangled emotional and identity information from unlabelled video inputs.
arXiv Detail & Related papers (2021-07-10T10:27:26Z)
Generating coherent spontaneous speech and gesture from text [21.90157862281996]
Embodied human communication encompasses both verbal (speech) and non-verbal information (e.g., gesture and head movements) Recent advances in machine learning have substantially improved the technologies for generating synthetic versions of both of these types of data. We put these two state-of-the-art technologies together in a coherent fashion for the first time.
arXiv Detail & Related papers (2021-01-14T16:02:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.