Related papers: 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing

URL: http://arxiv.org/abs/2312.00870v1
Date: Fri, 1 Dec 2023 19:01:05 GMT
Title: 3DiFACE: Diffusion-based Speech-driven 3D Facial Animation and Editing
Authors: Balamurugan Thambiraja, Sadegh Aliakbarian, Darren Cosker, Justus Thies
Abstract summary: We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. Our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.
Score: 22.30870274645442
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present 3DiFACE, a novel method for personalized speech-driven 3D facial animation and editing. While existing methods deterministically predict facial animations from speech, they overlook the inherent one-to-many relationship between speech and facial expressions, i.e., there are multiple reasonable facial expression animations matching an audio input. It is especially important in content creation to be able to modify generated motion or to specify keyframes. To enable stochasticity as well as motion editing, we propose a lightweight audio-conditioned diffusion model for 3D facial motion. This diffusion model can be trained on a small 3D motion dataset, maintaining expressive lip motion output. In addition, it can be finetuned for specific subjects, requiring only a short video of the person. Through quantitative and qualitative evaluations, we show that our method outperforms existing state-of-the-art techniques and yields speech-driven animations with greater fidelity and diversity.

Related papers

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model [73.30619724574642]
Speech-driven 3D facial animation aims to generate realistic and synchronized facial motions driven by speech inputs.<n>Recent methods have employed audio-conditioned diffusion models for 3D facial animation.<n>We propose a novel autoregressive diffusion model that processes audio in a streaming manner.
arXiv Detail & Related papers (2025-11-18T07:55:16Z)
3DiFACE: Synthesizing and Editing Holistic 3D Facial Animation [25.71615538597267]
We present 3DiFACE, a novel method for holistic speech-driven 3D facial animation.<n>Our approach produces diverse plausible lip and head motions for a single audio input.<n>We employ a speaking-style personalization and a novel sparsely-guided motion diffusion to enable precise control and editing.
arXiv Detail & Related papers (2025-09-30T13:30:01Z)
Think2Sing: Orchestrating Structured Motion Subtitles for Singing-Driven 3D Head Animation [69.50178144839275]
Singing involves richer emotional nuance, dynamic prosody, and lyric-based semantics.<n>Existing speech-driven approaches often produce oversimplified, emotionally flat, and semantically inconsistent results.<n>Think2Sing generates semantically coherent and temporally consistent 3D head animations conditioned on both lyrics and acoustics.
arXiv Detail & Related papers (2025-09-02T12:59:27Z)
EmoDiffusion: Enhancing Emotional 3D Facial Animation with Latent Diffusion Models [66.67979602235015]
EmoDiffusion is a novel approach that disentangles different emotions in speech to generate rich 3D emotional facial expressions. We capture facial expressions under the guidance of animation experts using LiveLinkFace on an iPhone.
arXiv Detail & Related papers (2025-03-14T02:54:22Z)
ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model [41.35209566957009]
Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. We introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks.
arXiv Detail & Related papers (2025-02-27T17:49:01Z)
MMHead: Towards Fine-grained Multi-modal 3D Facial Animation [68.04052669266174]
We construct a large-scale multi-modal 3D facial animation dataset, MMHead. MMHead consists of 49 hours of 3D facial motion sequences, speech audios, and rich hierarchical text annotations. Based on the MMHead dataset, we establish benchmarks for two new tasks: text-induced 3D talking head animation and text-to-3D facial motion generation.
arXiv Detail & Related papers (2024-10-10T09:37:01Z)
Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation [41.489700112318864]
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. We introduce an innovative speaking style disentanglement method, which enables arbitrary-subject speaking style encoding. We also propose a novel framework called textbfMimic to learn disentangled representations of the speaking style and content from facial motions.
arXiv Detail & Related papers (2023-12-18T01:49:42Z)
DiffusionTalker: Personalization and Acceleration for Speech-Driven 3D Face Diffuser [12.576421368393113]
Speech-driven 3D facial animation has been an attractive task in academia and industry. Recent approaches start to consider the non-deterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. We propose DiffusionTalker, a diffusion-based method that utilizes contrastive learning to personalize 3D facial animation and knowledge distillation to accelerate 3D animation generation.
arXiv Detail & Related papers (2023-11-28T07:13:20Z)
AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation [49.4220768835379]
AdaMesh is a novel adaptive speech-driven facial animation approach. It learns the personalized talking style from a reference video of about 10 seconds. It generates vivid facial expressions and head poses.
arXiv Detail & Related papers (2023-10-11T06:56:08Z)
FaceDiffuser: Speech-Driven 3D Facial Animation Synthesis Using Diffusion [0.0]
We present FaceDiffuser, a non-deterministic deep learning model to generate speech-driven facial animations. Our method is based on the diffusion technique and uses the pre-trained large speech representation model HuBERT to encode the audio input. We also introduce a new in-house dataset that is based on a blendshape based rigged character.
arXiv Detail & Related papers (2023-09-20T13:33:00Z)
DF-3DFace: One-to-Many Speech Synchronized 3D Face Animation with Diffusion [68.85904927374165]
We propose DF-3DFace, a diffusion-driven speech-to-3D face mesh synthesis. It captures the complex one-to-many relationships between speech and 3D face based on diffusion. It simultaneously achieves more realistic facial animation than the state-of-the-art methods.
arXiv Detail & Related papers (2023-08-23T04:14:55Z)
Audio-Driven Talking Face Generation with Diverse yet Realistic Facial Animations [61.65012981435094]
DIRFA is a novel method that can generate talking faces with diverse yet realistic facial animations from the same driving audio. To accommodate fair variation of plausible facial animations for the same audio, we design a transformer-based probabilistic mapping network. We show that DIRFA can generate talking faces with realistic facial animations effectively.
arXiv Detail & Related papers (2023-04-18T12:36:15Z)
Imitator: Personalized Speech-driven 3D Facial Animation [63.57811510502906]
State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies of the target actor. We present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video. We show that our approach produces temporally coherent facial expressions from input audio while preserving the speaking style of the target actors.
arXiv Detail & Related papers (2022-12-30T19:00:02Z)
MeshTalk: 3D Face Animation from Speech using Cross-Modality Disentanglement [142.9900055577252]
We propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. Our approach ensures highly accurate lip motion, while also plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion.
arXiv Detail & Related papers (2021-04-16T17:05:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.