Related papers: JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation

URL: http://arxiv.org/abs/2411.09209v3
Date: Wed, 20 Nov 2024 02:56:02 GMT
Title: JoyVASA: Portrait and Animal Image Animation with Diffusion-Based Audio-Driven Facial Dynamics and Head Motion Generation
Authors: Xuyang Cao, Guoxin Wang, Sheng Shi, Jun Zhao, Yang Yao, Jintao Fei, Minyu Gao,
Abstract summary: JoyVASA is a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. We introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. In the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity.
Score: 10.003794924759765
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven portrait animation has made significant advances with diffusion-based models, improving video quality and lipsync accuracy. However, the increasing complexity of these models has led to inefficiencies in training and inference, as well as constraints on video length and inter-frame continuity. In this paper, we propose JoyVASA, a diffusion-based method for generating facial dynamics and head motion in audio-driven facial animation. Specifically, in the first stage, we introduce a decoupled facial representation framework that separates dynamic facial expressions from static 3D facial representations. This decoupling allows the system to generate longer videos by combining any static 3D facial representation with dynamic motion sequences. Then, in the second stage, a diffusion transformer is trained to generate motion sequences directly from audio cues, independent of character identity. Finally, a generator trained in the first stage uses the 3D facial representation and the generated motion sequences as inputs to render high-quality animations. With the decoupled facial representation and the identity-independent motion generation process, JoyVASA extends beyond human portraits to animate animal faces seamlessly. The model is trained on a hybrid dataset of private Chinese and public English data, enabling multilingual support. Experimental results validate the effectiveness of our approach. Future work will focus on improving real-time performance and refining expression control, further expanding the applications in portrait animation. The code is available at: https://github.com/jdh-algo/JoyVASA.

Related papers

PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z)
StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
Hallo3: Highly Dynamic and Realistic Portrait Image Animation with Video Diffusion Transformer [25.39030226963548]
We introduce the first application of a pretrained transformer-based video generative model for portrait animation. Our method is validated through experiments on benchmark and newly proposed wild datasets.
arXiv Detail & Related papers (2024-12-01T08:54:30Z)
MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead. MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z)
Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance [41.692420421029695]
We introduce an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space. We then use GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. We propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation.
arXiv Detail & Related papers (2024-01-28T16:17:59Z)
SAiD: Speech-driven Blendshape Facial Animation with Diffusion [6.4271091365094515]
Speech-driven 3D facial animation is challenging due to the scarcity of large-scale visual-audio datasets. We propose a speech-driven 3D facial animation with a diffusion model (SAiD), a lightweight Transformer-based U-Net with a cross-modality alignment bias between audio and visual to enhance lip synchronization.
arXiv Detail & Related papers (2023-12-25T04:40:32Z)
3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing [22.30870274645442]
We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
arXiv Detail & Related papers (2023-12-01T19:01:05Z)
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. It captures the complex one-to-many relationships between speech and 3D face based on diffusion. It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z)
Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network. We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z)
FaceFormer: Speech-Driven 3D Facial Animation with Transformers [46.8780140220063]
Speech-driven 3D facial animation is challenging due to the complex geometry of human faces and the limited availability of 3D audio-visual data. We propose a Transformer-based autoregressive model, FaceFormer, which encodes the long-term audio context and autoregressively predicts a sequence of animated 3D face meshes.
arXiv Detail & Related papers (2021-12-10T04:21:59Z)
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)
Audio- and Gaze-driven Facial Animation of Codec Avatars [149.0094713268313]
We describe the first approach to animate Codec Avatars in real-time using audio and/or eye tracking. Our goal is to display expressive conversations between individuals that exhibit important social signals.
arXiv Detail & Related papers (2020-08-11T22:28:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.