Related papers: Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model

URL: http://arxiv.org/abs/2502.09533v1
Date: Thu, 13 Feb 2025 17:50:23 GMT
Title: Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model
Authors: Fei Shen, Cong Wang, Junyao Gao, Qin Guo, Jisheng Dang, Jinhui Tang, Tat-Seng Chua,
Abstract summary: We introduce the textbfMotion-priors textbfConditional textbfDiffusion textbfModel (textbfMCDM), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency.<n>We also release the textbfTalkingFace-Wild dataset, a multilingual collection of over 200 hours of footage across 10 languages.
Score: 64.11605839142348
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advances in conditional diffusion models have shown promise for generating realistic TalkingFace videos, yet challenges persist in achieving consistent head movement, synchronized facial expressions, and accurate lip synchronization over extended generations. To address these, we introduce the \textbf{M}otion-priors \textbf{C}onditional \textbf{D}iffusion \textbf{M}odel (\textbf{MCDM}), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency. The model consists of three key elements: (1) an archived-clip motion-prior that incorporates historical frames and a reference frame to preserve identity and context; (2) a present-clip motion-prior diffusion model that captures multimodal causality for accurate predictions of head movements, lip sync, and expressions; and (3) a memory-efficient temporal attention mechanism that mitigates error accumulation by dynamically storing and updating motion features. We also release the \textbf{TalkingFace-Wild} dataset, a multilingual collection of over 200 hours of footage across 10 languages. Experimental results demonstrate the effectiveness of MCDM in maintaining identity and motion continuity for long-term TalkingFace generation. Code, models, and datasets will be publicly available.

Related papers

M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.08520614570288]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
MoDiT: Learning Highly Consistent 3D Motion Coefficients with Diffusion Transformer for Talking Head Generation [16.202732894319084]
MoDiT is a novel framework that combines the 3D Morphable Model (3DMM) with a Diffusion-based Transformer.<n>Our contributions include: (i) A hierarchical denoising strategy with revised temporal attention and biased self/cross-attention mechanisms, enabling the model to refine lip synchronization.<n> (ii) The integration of 3DMM coefficients to provide explicit spatial constraints, ensuring accurate 3D-informed optical flow prediction.
arXiv Detail & Related papers (2025-07-07T15:13:46Z)
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios [48.09735396455107]
Hand-Object Interaction (HOI) generation has significant application potential.<n>Current 3D HOI motion generation approaches heavily rely on predefined 3D object models and lab-captured motion data.<n>We propose a novel framework that combines visual priors and dynamic constraints within a synchronized diffusion process to generate the HOI video and motion simultaneously.
arXiv Detail & Related papers (2025-06-03T05:04:29Z)
GENMO: A GENeralist Model for Human MOtion [64.16188966024542]
We present GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework.<n>Our key insight is to reformulate motion estimation as constrained motion generation, where the output motion must precisely satisfy observed conditioning signals.<n>Our novel architecture handles variable-length motions and mixed multimodal conditions (text, audio, video) at different time intervals, offering flexible control.
arXiv Detail & Related papers (2025-05-02T17:59:55Z)
Auto-Regressive Diffusion for Generating 3D Human-Object Interactions [5.587507490937267]
Key challenge in HOI generation is maintaining interaction consistency in long sequences. We propose an autoregressive diffusion model (ARDHOI) that predicts the next continuous token. Our model has been evaluated on the OMOMO and BEHAVE datasets.
arXiv Detail & Related papers (2025-03-21T02:25:59Z)
MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space [40.60429652169086]
Text-conditioned streaming motion generation requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths. We propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model.
arXiv Detail & Related papers (2025-03-19T17:32:24Z)
AR-Diffusion: Asynchronous Video Generation with Auto-Regressive Diffusion [19.98565541640125]
We introduce Auto-Regressive Diffusion (AR-Diffusion), a novel model that combines the strengths of auto-regressive and diffusion models for flexible video generation.<n>Inspired by auto-regressive generation, we incorporate a non-decreasing constraint on the corruption timesteps of individual frames.<n>This setup, together with temporal causal attention, enables flexible generation of videos with varying lengths while preserving temporal coherence.
arXiv Detail & Related papers (2025-03-10T15:05:59Z)
Removing Averaging: Personalized Lip-Sync Driven Characters Based on Identity Adapter [10.608872317957026]
"lip averaging" phenomenon occurs when a model fails to preserve subtle facial details when dubbing unseen in-the-wild videos. We propose UnAvgLip, which extracts identity embeddings from reference videos to generate highly faithful facial sequences.
arXiv Detail & Related papers (2025-03-09T02:36:31Z)
Stereo-Talker: Audio-driven 3D Human Synthesis with Prior-Guided Mixture-of-Experts [41.08576055846111]
Stereo-Talker is a novel one-shot audio-driven human video synthesis system. It generates 3D talking videos with precise lip synchronization, expressive body gestures, temporally consistent photo-realistic quality, and continuous viewpoint control.
arXiv Detail & Related papers (2024-10-31T11:32:33Z)
Text-driven Human Motion Generation with Motion Masked Diffusion Model [23.637853270123045]
Text human motion generation is a task that synthesizes human motion sequences conditioned on natural language. Current diffusion model-based approaches have outstanding performance in the diversity and multimodality of generation. We propose Motion Masked Diffusion Model bftext(MMDM), a novel human motion mechanism for diffusion model.
arXiv Detail & Related papers (2024-09-29T12:26:24Z)
Live2Diff: Live Stream Translation via Uni-directional Attention in Video Diffusion Models [64.2445487645478]
Large Language Models have shown remarkable efficacy in generating streaming data such as text and audio. We present Live2Diff, the first attempt at designing a video diffusion model with uni-directional temporal attention, specifically targeting live streaming video translation.
arXiv Detail & Related papers (2024-07-11T17:34:51Z)
Boosting Consistency in Story Visualization with Rich-Contextual Conditional Diffusion Models [12.907590808274358]
We propose a novel Rich-contextual Diffusion Models (RCDMs) to enhance story generation's semantic consistency and temporal consistency. RCDMs can generate consistent stories with a single forward inference compared to autoregressive models.
arXiv Detail & Related papers (2024-07-02T17:58:07Z)
Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model [17.98911328064481]
Co-speech gestures can achieve superior visual effects in human-machine interaction. We present a novel motion-decoupled framework to generate co-speech gesture videos. Our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations.
arXiv Detail & Related papers (2024-04-02T11:40:34Z)
Move as You Say, Interact as You Can: Language-guided Human Motion Generation with Scene Affordance [48.986552871497]
We introduce a novel two-stage framework that employs scene affordance as an intermediate representation. By leveraging scene affordance maps, our method overcomes the difficulty in generating human motion under multimodal condition signals. Our approach consistently outperforms all baselines on established benchmarks, including HumanML3D and HUMANISE.
arXiv Detail & Related papers (2024-03-26T18:41:07Z)
DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation [72.85685916829321]
DiffSHEG is a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation with arbitrary length. By enabling the real-time generation of expressive and synchronized motions, DiffSHEG showcases its potential for various applications in the development of digital humans and embodied agents.
arXiv Detail & Related papers (2024-01-09T11:38:18Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
Swap Attention in Spatiotemporal Diffusions for Text-to-Video Generation [55.36617538438858]
We propose a novel approach that strengthens the interaction between spatial and temporal perceptions. We curate a large-scale and open-source video dataset called HD-VG-130M.
arXiv Detail & Related papers (2023-05-18T11:06:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.