Related papers: VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization

URL: http://arxiv.org/abs/2412.09892v2
Date: Wed, 18 Dec 2024 11:12:49 GMT
Title: VQTalker: Towards Multilingual Talking Avatars through Facial Motion Tokenization
Authors: Tao Liu, Ziyang Ma, Qi Chen, Feilong Chen, Shuai Fan, Xie Chen, Kai Yu,
Abstract summary: VQTalker is a Vector Quantization-based framework for multilingual talking head generation.<n>Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units.<n>VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios.
Score: 20.728919218746363
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: We present VQTalker, a Vector Quantization-based framework for multilingual talking head generation that addresses the challenges of lip synchronization and natural motion across diverse languages. Our approach is grounded in the phonetic principle that human speech comprises a finite set of distinct sound units (phonemes) and corresponding visual articulations (visemes), which often share commonalities across languages. We introduce a facial motion tokenizer based on Group Residual Finite Scalar Quantization (GRFSQ), which creates a discretized representation of facial features. This method enables comprehensive capture of facial movements while improving generalization to multiple languages, even with limited training data. Building on this quantized representation, we implement a coarse-to-fine motion generation process that progressively refines facial animations. Extensive experiments demonstrate that VQTalker achieves state-of-the-art performance in both video-driven and speech-driven scenarios, particularly in multilingual settings. Notably, our method achieves high-quality results at a resolution of 512*512 pixels while maintaining a lower bitrate of approximately 11 kbps. Our work opens new possibilities for cross-lingual talking face generation. Synthetic results can be viewed at https://x-lance.github.io/VQTalker.

Related papers

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages [60.81571443992153]
Speech-driven talking face synthesis (TFS) focuses on generating facial animations from audio input.<n>Current models perform well in English but unsatisily in non-English languages, producing wrong mouth shapes and rigid facial expressions.<n>We propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts architecture.
arXiv Detail & Related papers (2025-10-08T03:46:39Z)
Text2Lip: Progressive Lip-Synced Talking Face Generation from Text via Viseme-Guided Rendering [53.2204901422631]
Text2Lip is a viseme-centric framework that constructs an interpretable phonetic-visual bridge.<n>We show that Text2Lip outperforms existing approaches in semantic fidelity, visual realism, and modality robustness.
arXiv Detail & Related papers (2025-08-04T12:50:22Z)
Emotional Face-to-Speech [13.725558939494407]
Existing face-to-speech methods offer great promise in capturing identity characteristics but struggle to generate diverse vocal styles with emotional expression. We introduce DEmoFace, a novel generative framework that leverages a discrete diffusion transformer (DiT) with curriculum learning. We develop an enhanced predictor-free guidance to handle diverse conditioning scenarios, enabling multi-conditional generation and disentangling complex attributes effectively.
arXiv Detail & Related papers (2025-02-03T04:48:50Z)
GaussianSpeech: Audio-Driven Gaussian Avatars [76.10163891172192]
We introduce GaussianSpeech, a novel approach that synthesizes high-fidelity animation sequences of photo-realistic, personalized 3D human head avatars from spoken audio.<n>We propose a compact and efficient 3DGS-based avatar representation that generates expression-dependent color and leverages wrinkle- and perceptually-based losses to synthesize facial details.
arXiv Detail & Related papers (2024-11-27T18:54:08Z)
GLDiTalker: Speech-Driven 3D Facial Animation with Graph Latent Diffusion Transformer [26.567649613966974]
This paper introduces GLDiTalker, a novel speech-driven 3D facial animation model that employs a Graph Latent Diffusion Transformer. The core idea behind GLDiTalker is that the audio-mesh modality misalignment can be resolved by diffusing the signal in a latent quantilized spatial-temporal space.
arXiv Detail & Related papers (2024-08-03T17:18:26Z)
Speech2UnifiedExpressions: Synchronous Synthesis of Co-Speech Affective Face and Body Expressions from Affordable Inputs [67.27840327499625]
We present a multimodal learning-based method to simultaneously synthesize co-speech facial expressions and upper-body gestures for digital characters. Our approach learns from sparse face landmarks and upper-body joints, estimated directly from video data, to generate plausible emotive character motions.
arXiv Detail & Related papers (2024-06-26T04:53:11Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
Can Language Models Learn to Listen? [96.01685069483025]
We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study.
arXiv Detail & Related papers (2023-08-21T17:59:02Z)
PaLI-X: On Scaling up a Multilingual Vision and Language Model [166.9837904115951]
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model. Our model achieves new levels of performance on a wide-range of varied and complex tasks. We observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix.
arXiv Detail & Related papers (2023-05-29T18:58:38Z)
Language-Guided Face Animation by Recurrent StyleGAN-based Generator [87.56260982475564]
We study a novel task, language-guided face animation, that aims to animate a static face image with the help of languages. We propose a recurrent motion generator to extract a series of semantic and motion information from the language and feed it along with visual information to a pre-trained StyleGAN to generate high-quality frames.
arXiv Detail & Related papers (2022-08-11T02:57:30Z)
Talking Face Generation with Multilingual TTS [0.8229645116651871]
We propose a system combining a talking face generation system with a text-to-speech system. Our system can synthesize natural multilingual speeches while maintaining the vocal identity of the speaker. For our demo, we add a translation API to the preprocessing stage and present it in the form of a neural dubber.
arXiv Detail & Related papers (2022-05-13T02:08:35Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Robust One Shot Audio to Video Generation [10.957973845883162]
OneShotA2V is a novel approach to synthesize a talking person video of arbitrary length using as input: an audio signal and a single unseen image of a person. OneShotA2V leverages curriculum learning to learn movements of expressive facial components and hence generates a high-quality talking-head video of the given person.
arXiv Detail & Related papers (2020-12-14T10:50:05Z)
Multi Modal Adaptive Normalization for Audio to Video Generation [18.812696623555855]
We propose a multi-modal adaptive normalization(MAN) based architecture to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The architecture uses the multi-modal adaptive normalization, keypoint heatmap predictor, optical flow predictor and class activation map[58] based layers to learn movements of expressive facial components.
arXiv Detail & Related papers (2020-12-14T07:39:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.