Related papers: DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis

URL: http://arxiv.org/abs/2510.10650v1
Date: Sun, 12 Oct 2025 15:10:33 GMT
Title: DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis
Authors: Peiyin Chen, Zhuowei Yang, Hui Feng, Sheng Jiang, Rui Yan,
Abstract summary: DEMO is a flow-matching generative framework for audio-driven talking-head video synthesis.<n>It delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze.
Score: 15.304037069236536
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Audio-driven talking-head generation has advanced rapidly with diffusion-based generative models, yet producing temporally coherent videos with fine-grained motion control remains challenging. We propose DEMO, a flow-matching generative framework for audio-driven talking-portrait video synthesis that delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze. The core contribution is a motion auto-encoder that builds a structured latent space in which motion factors are independently represented and approximately orthogonalized. On this disentangled motion space, we apply optimal-transport-based flow matching with a transformer predictor to generate temporally smooth motion trajectories conditioned on audio. Extensive experiments across multiple benchmarks show that DEMO outperforms prior methods in video realism, lip-audio synchronization, and motion fidelity. These results demonstrate that combining fine-grained motion disentanglement with flow-based generative modeling provides a powerful new paradigm for controllable talking-head video synthesis.

Related papers

Causal Motion Diffusion Models for Autoregressive Motion Generation [19.61051102039212]
Causal Motion Diffusion Models (CMDM) is a unified framework for autoregressive motion generation.<n> CMDM builds upon a Motion-Language-Aligned Causal VAE (MAC-VAE), which encodes motion sequences into temporally causal latent representations.<n>Experiments on HumanML3D and SnapMoGen demonstrate CMDM outperforms existing diffusion and autoregressive models in both semantic fidelity and temporal smoothness.
arXiv Detail & Related papers (2026-02-26T03:58:25Z)
MotionVerse: A Unified Multimodal Framework for Motion Comprehension, Generation and Editing [53.98607267063729]
MotionVerse is a framework to comprehend, generate, and edit human motion in both single-person and multi-person scenarios.<n>We employ a motion tokenizer with residual quantization, which converts continuous motion sequences into multi-stream discrete tokens.<n>We also introduce a textitDelay Parallel Modeling strategy, which temporally staggers the encoding of residual token streams.
arXiv Detail & Related papers (2025-09-28T04:20:56Z)
MotionFlux: Efficient Text-Guided Motion Generation through Rectified Flow Matching and Preference Alignment [38.42799902378583]
Motion generation is essential for animating virtual characters and embodied agents.<n>TAPO and MotionFLUX form a unified system that outperforms state-of-the-art approaches in both semantic consistency and motion quality.
arXiv Detail & Related papers (2025-08-27T02:45:09Z)
HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [55.92704600574577]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z)
READ: Real-time and Efficient Asynchronous Diffusion for Audio-driven Talking Head Generation [55.58089937219475]
We propose READ, the first real-time diffusion-transformer-based talking head generation framework.<n>Our approach first learns highly compressed video latent space via a VAE, significantly reducing the token count to speech generation.<n>We show that READ outperforms state-of-the-art methods by generating competitive talking head videos with significantly reduced runtime.
arXiv Detail & Related papers (2025-08-05T13:57:03Z)
M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z)
Ditto: Motion-Space Diffusion for Controllable Realtime Talking Head Synthesis [27.43583075023949]
Ditto is a diffusion-based talking head framework that enables fine-grained controls and real-time inference.<n>We show that Ditto generates compelling talking head videos and exhibits superiority in both controllability and real-time performance.
arXiv Detail & Related papers (2024-11-29T07:01:31Z)
LaMD: Latent Motion Diffusion for Image-Conditional Video Generation [63.34574080016687]
latent motion diffusion (LaMD) framework consists of a motion-decomposed video autoencoder and a diffusion-based motion generator.<n>LaMD generates high-quality videos on various benchmark datasets, including BAIR, Landscape, NATOPS, MUG and CATER-GEN.
arXiv Detail & Related papers (2023-04-23T10:32:32Z)
MoFusion: A Framework for Denoising-Diffusion-based Motion Synthesis [73.52948992990191]
MoFusion is a new denoising-diffusion-based framework for high-quality conditional human motion synthesis. We present ways to introduce well-known kinematic losses for motion plausibility within the motion diffusion framework. We demonstrate the effectiveness of MoFusion compared to the state of the art on established benchmarks in the literature.
arXiv Detail & Related papers (2022-12-08T18:59:48Z)

This list is automatically generated from the titles and abstracts of the papers in this site.