Related papers: HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis

URL: http://arxiv.org/abs/2508.10566v1
Date: Thu, 14 Aug 2025 12:01:52 GMT
Title: HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis
Authors: Shiyu Liu, Kui Jiang, Xianming Liu, Hongxun Yao, Xiaocheng Feng,
Abstract summary: HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
Score: 55.92704600574577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Audio-driven talking head video generation enhances user engagement in human-computer interaction. However, current methods frequently produce videos with motion blur and lip jitter, primarily due to their reliance on implicit modeling of audio-facial motion correlations--an approach lacking explicit articulatory priors (i.e., anatomical guidance for speech-related facial movements). To overcome this limitation, we propose HM-Talker, a novel framework for generating high-fidelity, temporally coherent talking heads. HM-Talker leverages a hybrid motion representation combining both implicit and explicit motion cues. Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment. Specifically, our Cross-Modal Disentanglement Module (CMDM) extracts complementary implicit/explicit motion features while predicting AUs directly from audio input aligned to visual cues. To mitigate identity-dependent biases in explicit features and enhance cross-subject generalization, we introduce the Hybrid Motion Modeling Module (HMMM). This module dynamically merges randomly paired implicit/explicit features, enforcing identity-agnostic learning. Together, these components enable robust lip synchronization across diverse identities, advancing personalized talking head synthesis. Extensive experiments demonstrate HM-Talker's superiority over state-of-the-art methods in visual quality and lip-sync accuracy.

Related papers

MIBURI: Towards Expressive Interactive Gesture Synthesis [62.45332399212876]
Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions.<n>Existing solutions for ECAs produce rigid, low-diversity motions that are unsuitable for human-like interaction.<n>We present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue.
arXiv Detail & Related papers (2026-03-03T18:59:51Z)
IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer [35.816717494490725]
IMTalker is a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer.<n>To preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module.<n>A lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues.
arXiv Detail & Related papers (2025-11-27T07:12:26Z)
DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis [15.304037069236536]
DEMO is a flow-matching generative framework for audio-driven talking-head video synthesis.<n>It delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze.
arXiv Detail & Related papers (2025-10-12T15:10:33Z)
SayAnything: Audio-Driven Lip Synchronization with Conditional Video Diffusion [78.77211425667542]
SayAnything is a conditional video diffusion framework that directly synthesizes lip movements from audio input.<n>Our novel design effectively balances different condition signals in the latent space, enabling precise control over appearance, motion, and region-specific generation.
arXiv Detail & Related papers (2025-02-17T07:29:36Z)
GoHD: Gaze-oriented and Highly Disentangled Portrait Animation with Rhythmic Poses and Realistic Expression [33.886734972316326]
GoHD is a framework designed to produce highly realistic, expressive, and controllable portrait videos from any reference identity with any motion.<n>An animation module utilizing latent navigation is introduced to improve the generalization ability across unseen input styles.<n>A conformer-structured conditional diffusion model is designed to guarantee head poses that are aware of prosody.<n>A two-stage training strategy is devised to decouple frequent and frame-wise lip motion distillation from the generation of other more temporally dependent but less audio-related motions.
arXiv Detail & Related papers (2024-12-12T14:12:07Z)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation. We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw. Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z)
FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio [45.71036380866305]
We abstract the process of people hearing speech, extracting meaningful cues, and creating dynamically audio-consistent talking faces from a single audio. Specifically, it involves two critical challenges: one is to effectively decouple identity, content, and emotion from entangled audio, and the other is to maintain intra-video diversity and inter-video consistency. We introduce the Controllable Coherent Frame generation, which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models.
arXiv Detail & Related papers (2024-03-04T09:59:48Z)
FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models [85.16273912625022]
We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from audio signal. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of human heads.
arXiv Detail & Related papers (2023-12-13T19:01:07Z)
Listen, Denoise, Action! Audio-Driven Motion Synthesis with Diffusion Models [22.000197530493445]
We show that diffusion models are an excellent fit for synthesising human motion that co-occurs with audio. We adapt the DiffWave architecture to model 3D pose sequences, putting Conformers in place of dilated convolutions for improved modelling power. Experiments on gesture and dance generation confirm that the proposed method achieves top-of-the-line motion quality.
arXiv Detail & Related papers (2022-11-17T17:41:00Z)
StyleTalker: One-shot Style-based Audio-driven Talking Head Video Generation [47.06075725469252]
StyleTalker is an audio-driven talking head generation model. It can synthesize a video of a talking person from a single reference image. Our model is able to synthesize talking head videos with impressive perceptual quality.
arXiv Detail & Related papers (2022-08-23T12:49:01Z)
Learning to Listen: Modeling Non-Deterministic Dyadic Facial Motion [89.01668641930206]
We present a framework for modeling interactional communication in dyadic conversations. We autoregressively output multiple possibilities of corresponding listener motion. Our method organically captures the multimodal and non-deterministic nature of nonverbal dyadic interactions.
arXiv Detail & Related papers (2022-04-18T17:58:04Z)
Let's Face It: Probabilistic Multi-modal Interlocutor-aware Generation of Facial Gestures in Dyadic Settings [11.741529272872219]
To enable more natural face-to-face interactions, conversational agents need to adapt their behavior to their interlocutors. Most existing gesture-generating systems do not utilize multi-modal cues from the interlocutor when synthesizing non-verbal behavior. We introduce a probabilistic method to synthesize interlocutor-aware facial gestures in dyadic conversations.
arXiv Detail & Related papers (2020-06-11T14:11:51Z)

This list is automatically generated from the titles and abstracts of the papers in this site.