AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D
Talking Face Generation
- URL: http://arxiv.org/abs/2402.16124v1
- Date: Sun, 25 Feb 2024 15:51:05 GMT
- Title: AVI-Talking: Learning Audio-Visual Instructions for Expressive 3D
Talking Face Generation
- Authors: Yasheng Sun, Wenqing Chu, Hang Zhou, Kaisiyuan Wang, Hideki Koike
- Abstract summary: We propose an Audio-Visual Instruction system for expressive Talking face generation.
Instead of directly learning facial movements from human speech, our two-stage strategy involves the LLMs first comprehending audio information.
This two-stage process, coupled with the incorporation of LLMs, enhances model interpretability and provides users with flexibility to comprehend instructions.
- Score: 28.71632683090641
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While considerable progress has been made in achieving accurate lip
synchronization for 3D speech-driven talking face generation, the task of
incorporating expressive facial detail synthesis aligned with the speaker's
speaking status remains challenging. Our goal is to directly leverage the
inherent style information conveyed by human speech for generating an
expressive talking face that aligns with the speaking status. In this paper, we
propose AVI-Talking, an Audio-Visual Instruction system for expressive Talking
face generation. This system harnesses the robust contextual reasoning and
hallucination capability offered by Large Language Models (LLMs) to instruct
the realistic synthesis of 3D talking faces. Instead of directly learning
facial movements from human speech, our two-stage strategy involves the LLMs
first comprehending audio information and generating instructions implying
expressive facial details seamlessly corresponding to the speech. Subsequently,
a diffusion-based generative network executes these instructions. This
two-stage process, coupled with the incorporation of LLMs, enhances model
interpretability and provides users with flexibility to comprehend instructions
and specify desired operations or modifications. Extensive experiments showcase
the effectiveness of our approach in producing vivid talking faces with
expressive facial movements and consistent emotional status.
Related papers
- JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation [24.2065254076207]
We introduce a novel method for joint expression and audio-guided talking face generation.
Our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer.
arXiv Detail & Related papers (2024-09-18T17:18:13Z) - GSmoothFace: Generalized Smooth Talking Face Generation via Fine Grained
3D Face Guidance [83.43852715997596]
GSmoothFace is a novel two-stage generalized talking face generation model guided by a fine-grained 3d face model.
It can synthesize smooth lip dynamics while preserving the speaker's identity.
Both quantitative and qualitative experiments confirm the superiority of our method in terms of realism, lip synchronization, and visual quality.
arXiv Detail & Related papers (2023-12-12T16:00:55Z) - Personalized Speech-driven Expressive 3D Facial Animation Synthesis with
Style Control [1.8540152959438578]
A realistic facial animation system should consider such identity-specific speaking styles and facial idiosyncrasies to achieve high-degree of naturalness and plausibility.
We present a speech-driven expressive 3D facial animation synthesis framework that models identity specific facial motion as latent representations (called as styles)
Our framework is trained in an end-to-end fashion and has a non-autoregressive encoder-decoder architecture with three main components.
arXiv Detail & Related papers (2023-10-25T21:22:28Z) - DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with
Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis.
It captures the complex one-to-many relationships between speech and 3D face based on diffusion.
It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z) - Visual-Aware Text-to-Speech [101.89332968344102]
We present a new visual-aware text-to-speech (VA-TTS) task to synthesize speech conditioned on both textual inputs and visual feedback of the listener in face-to-face communication.
We devise a baseline model to fuse phoneme linguistic information and listener visual signals for speech synthesis.
arXiv Detail & Related papers (2023-06-21T05:11:39Z) - Parametric Implicit Face Representation for Audio-Driven Facial
Reenactment [52.33618333954383]
We propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads.
Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models.
Our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers.
arXiv Detail & Related papers (2023-06-13T07:08:22Z) - FaceXHuBERT: Text-less Speech-driven E(X)pressive 3D Facial Animation
Synthesis Using Self-Supervised Speech Representation Learning [0.0]
FaceXHuBERT is a text-less speech-driven 3D facial animation generation method.
It is very robust to background noise and can handle audio recorded in a variety of situations.
It produces superior results with respect to the realism of the animation 78% of the time.
arXiv Detail & Related papers (2023-03-09T17:05:19Z) - Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor.
We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video.
We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z) - Joint Audio-Text Model for Expressive Speech-Driven 3D Facial Animation [46.8780140220063]
We present a joint audio-text model to capture contextual information for expressive speech-driven 3D facial animation.
Our hypothesis is that the text features can disambiguate the variations in upper face expressions, which are not strongly correlated with the audio.
We show that the combined acoustic and textual modalities can synthesize realistic facial expressions while maintaining audio-lip synchronization.
arXiv Detail & Related papers (2021-12-04T01:37:22Z) - Write-a-speaker: Text-based Emotional and Rhythmic Talking-head
Generation [28.157431757281692]
We propose a text-based talking-head video generation framework that synthesizes high-fidelity facial expressions and head motions.
Our framework consists of a speaker-independent stage and a speaker-specific stage.
Our algorithm achieves high-quality photo-realistic talking-head videos including various facial expressions and head motions according to speech rhythms.
arXiv Detail & Related papers (2021-04-16T09:44:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.