Related papers: X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention

URL: http://arxiv.org/abs/2507.23143v1
Date: Wed, 30 Jul 2025 22:46:52 GMT
Title: X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention
Authors: Xiaochen Zhao, Hongyi Xu, Guoxian Song, You Xie, Chenxu Zhang, Xiu Li, Linjie Luo, Jinli Suo, Yebin Liu,
Abstract summary: X-NeMo is a zero-shot diffusion-based portrait animation pipeline.<n>It animates a static portrait using facial movements from a driving video of a different individual.
Score: 52.94097577075215
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We propose X-NeMo, a novel zero-shot diffusion-based portrait animation pipeline that animates a static portrait using facial movements from a driving video of a different individual. Our work first identifies the root causes of the key issues in prior approaches, such as identity leakage and difficulty in capturing subtle and extreme expressions. To address these challenges, we introduce a fully end-to-end training framework that distills a 1D identity-agnostic latent motion descriptor from driving image, effectively controlling motion through cross-attention during image generation. Our implicit motion descriptor captures expressive facial motion in fine detail, learned end-to-end from a diverse video dataset without reliance on pretrained motion detectors. We further enhance expressiveness and disentangle motion latents from identity cues by supervising their learning with a dual GAN decoder, alongside spatial and color augmentations. By embedding the driving motion into a 1D latent vector and controlling motion via cross-attention rather than additive spatial guidance, our design eliminates the transmission of spatial-aligned structural clues from the driving condition to the diffusion backbone, substantially mitigating identity leakage. Extensive experiments demonstrate that X-NeMo surpasses state-of-the-art baselines, producing highly expressive animations with superior identity resemblance. Our code and models are available for research.

Related papers

FantasyTalking: Realistic Talking Portrait Generation via Coherent Motion Synthesis [12.987186425491242]
We propose a novel framework to generate high-fidelity, coherent talking portraits with controllable motion dynamics.<n>In the first stage, we employ a clip-level training scheme to establish coherent global motion.<n>In the second stage, we refine lip movements at the frame level using a lip-tracing mask, ensuring precise synchronization with audio signals.
arXiv Detail & Related papers (2025-04-07T08:56:01Z)
X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image.<n>It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z)
AniTalker: Animate Vivid and Diverse Talking Faces through Identity-Decoupled Facial Motion Encoding [24.486705010561067]
The paper introduces AniTalker, a framework designed to generate lifelike talking faces from a single portrait. AniTalker effectively captures a wide range of facial dynamics, including subtle expressions and head movements.
arXiv Detail & Related papers (2024-05-06T02:32:41Z)
X-Portrait: Expressive Portrait Animation with Hierarchical Motion Attention [18.211762995744337]
We propose X-Portrait, an innovative conditional diffusion model tailored for generating expressive and temporally coherent portrait animation. Given a single portrait as appearance reference, we aim to animate it with motion derived from a driving video, capturing both highly dynamic and subtle facial expressions. Experimental results demonstrate the universal effectiveness of X-Portrait across a diverse range of facial portraits and expressive driving sequences.
arXiv Detail & Related papers (2024-03-23T20:30:28Z)
AnaMoDiff: 2D Analogical Motion Diffusion via Disentangled Denoising [25.839194626743126]
AnaMoDiff is a novel diffusion-based method for 2D motion analogies. Our goal is to accurately transfer motions from a 2D driving video onto a source character, with its identity, in terms of appearance and natural movement.
arXiv Detail & Related papers (2024-02-05T22:10:54Z)
MotionCrafter: One-Shot Motion Customization of Diffusion Models [66.44642854791807]
We introduce MotionCrafter, a one-shot instance-guided motion customization method. MotionCrafter employs a parallel spatial-temporal architecture that injects the reference motion into the temporal component of the base model. During training, a frozen base model provides appearance normalization, effectively separating appearance from motion.
arXiv Detail & Related papers (2023-12-08T16:31:04Z)
Universal Humanoid Motion Representations for Physics-Based Control [71.46142106079292]
We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. We first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator.
arXiv Detail & Related papers (2023-10-06T20:48:43Z)
Priority-Centric Human Motion Generation in Discrete Latent Space [59.401128190423535]
We introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM) for text-to-motion generation. M2DM incorporates a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token.
arXiv Detail & Related papers (2023-08-28T10:40:16Z)
CodeTalker: Speech-Driven 3D Facial Animation with Discrete Motion Prior [27.989344587876964]
Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness. We propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2023-01-06T05:04:32Z)
Animation from Blur: Multi-modal Blur Decomposition with Motion Guidance [83.25826307000717]
We study the challenging problem of recovering detailed motion from a single motion-red image. Existing solutions to this problem estimate a single image sequence without considering the motion ambiguity for each region. In this paper, we explicitly account for such motion ambiguity, allowing us to generate multiple plausible solutions all in sharp detail.
arXiv Detail & Related papers (2022-07-20T18:05:53Z)

This list is automatically generated from the titles and abstracts of the papers in this site.