IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer
- URL: http://arxiv.org/abs/2511.22167v1
- Date: Thu, 27 Nov 2025 07:12:26 GMT
- Title: IMTalker: Efficient Audio-driven Talking Face Generation with Implicit Motion Transfer
- Authors: Bo Chen, Tao Liu, Qi Chen, Xie Chen, Zilong Zheng,
- Abstract summary: IMTalker is a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer.<n>To preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module.<n>A lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues.
- Score: 35.816717494490725
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Talking face generation aims to synthesize realistic speaking portraits from a single image, yet existing methods often rely on explicit optical flow and local warping, which fail to model complex global motions and cause identity drift. We present IMTalker, a novel framework that achieves efficient and high-fidelity talking face generation through implicit motion transfer. The core idea is to replace traditional flow-based warping with a cross-attention mechanism that implicitly models motion discrepancy and identity alignment within a unified latent space, enabling robust global motion rendering. To further preserve speaker identity during cross-identity reenactment, we introduce an identity-adaptive module that projects motion latents into personalized spaces, ensuring clear disentanglement between motion and identity. In addition, a lightweight flow-matching motion generator produces vivid and controllable implicit motion vectors from audio, pose, and gaze cues. Extensive experiments demonstrate that IMTalker surpasses prior methods in motion accuracy, identity preservation, and audio-lip synchronization, achieving state-of-the-art quality with superior efficiency, operating at 40 FPS for video-driven and 42 FPS for audio-driven generation on an RTX 4090 GPU. We will release our code and pre-trained models to facilitate applications and future research.
Related papers
- IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z) - MAGIC-Talk: Motion-aware Audio-Driven Talking Face Generation with Customizable Identity Control [48.94486508604052]
MAGIC-Talk is a one-shot diffusion-based framework for customizable talking face generation.<n> ReferenceNet preserves identity and enables fine-grained facial editing via text prompts.<n>AnimateNet enhances motion coherence using structured motion priors.
arXiv Detail & Related papers (2025-10-26T19:49:31Z) - DEMO: Disentangled Motion Latent Flow Matching for Fine-Grained Controllable Talking Portrait Synthesis [15.304037069236536]
DEMO is a flow-matching generative framework for audio-driven talking-head video synthesis.<n>It delivers disentangled, high-fidelity control of lip motion, head pose, and eye gaze.
arXiv Detail & Related papers (2025-10-12T15:10:33Z) - HM-Talker: Hybrid Motion Modeling for High-Fidelity Talking Head Synthesis [90.74616208952791]
HM-Talker is a novel framework for generating high-fidelity, temporally coherent talking heads.<n>Explicit cues use Action Units (AUs), anatomically defined facial muscle movements, alongside implicit features to minimize phoneme-viseme misalignment.
arXiv Detail & Related papers (2025-08-14T12:01:52Z) - M2DAO-Talker: Harmonizing Multi-granular Motion Decoupling and Alternating Optimization for Talking-head Generation [65.48046909056468]
We reformulate talking head generation into a unified framework comprising video preprocessing, motion representation, and rendering reconstruction.<n>M2DAO-Talker achieves state-of-the-art performance, with the 2.43 dB PSNR improvement in generation quality and 0.64 gain in user-evaluated video realness.
arXiv Detail & Related papers (2025-07-11T04:48:12Z) - FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.<n>In the first stage, we employ a clip-level training scheme to establish coherent global motion.<n>In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z) - PortraitTalk: Towards Customizable One-Shot Audio-to-Talking Face Generation [48.94486508604052]
We introduce a novel, customizable one-shot audio-driven talking face generation framework, named PortraitTalk.<n>Our proposed method utilizes a latent diffusion framework consisting of two main components: IdentityNet and AnimateNet.<n>Key innovation of PortraitTalk is the incorporation of text prompts through decoupled cross-attention mechanisms.
arXiv Detail & Related papers (2024-12-10T18:51:31Z) - FLOAT: Generative Motion Latent Flow Matching for Audio-driven Talking Portrait [11.670159942656129]
FLOAT is an audio-driven talking portrait video generation method based on flow matching generative model.<n>Our method supports speech-driven emotion enhancement, enabling a natural incorporation of expressive motions.
arXiv Detail & Related papers (2024-12-02T02:50:07Z) - High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model [89.29655924125461]
We propose a novel landmark-based diffusion model for talking face generation.
We first establish the less ambiguous mapping from audio to landmark motion of lip and jaw.
Then, we introduce an innovative conditioning module called TalkFormer to align the synthesized motion with the motion represented by landmarks.
arXiv Detail & Related papers (2024-08-10T02:58:28Z) - AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait.
AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.