Stable Video-Driven Portraits
- URL: http://arxiv.org/abs/2509.17476v1
- Date: Mon, 22 Sep 2025 08:11:08 GMT
- Title: Stable Video-Driven Portraits
- Authors: Mallikarjun B. R., Fei Yin, Vikram Voleti, Nikita Drobyshev, Maksim Lapin, Aaryaman Vasishta, Varun Jampani,
- Abstract summary: Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
- Score: 52.008400639227034
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Portrait animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video. While early methods relied on 3D morphable models or feature warping techniques, they often suffered from limited expressivity, temporal inconsistency, and poor generalization to unseen identities or large pose variations. Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations. In this work, we propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues. To enable robust training without appearance leakage, we adopt cross identity supervision. To leverage the strong prior from the pretrained diffusion model, our novel architecture introduces minimal new parameters that converge faster and help in better generalization. We introduce spatial temporal attention mechanisms that allow inter frame and intra frame interactions, effectively capturing subtle motions and reducing temporal artifacts. Our model uses history frames to ensure continuity across segments. At inference, we propose a novel signal fusion strategy that balances motion fidelity with identity preservation. Our approach achieves superior temporal consistency and accurate expression control, enabling high-quality, controllable portrait animation suitable for real-world applications.
Related papers
- IM-Animation: An Implicit Motion Representation for Identity-decoupled Character Animation [58.297199313494]
Implicit methods capture motion semantics directly from driving video, but suffer from identity leakage and entanglement between motion and appearance.<n>We propose a novel implicit motion representation that compresses per-frame motion into compact 1D motion tokens.<n>Our methodology employs a three-stage training strategy to enhance the training efficiency and ensure high fidelity.
arXiv Detail & Related papers (2026-02-07T11:17:20Z) - High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer [17.388852038062705]
We propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos.<n>First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance.<n>Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module.
arXiv Detail & Related papers (2025-12-26T07:36:48Z) - DeX-Portrait: Disentangled and Expressive Portrait Animation via Explicit and Latent Motion Representations [31.845995837468536]
We propose DeX-Portrait, a novel approach capable of generating expressive portrait animation driven by disentangled pose and expression signals.<n>First, we design a powerful motion trainer to learn both pose and expression encoders for extracting precise and decomposed driving signals.<n>Experiments show that our method outperforms state-of-the-art baselines on both animation quality and disentangled controllability.
arXiv Detail & Related papers (2025-12-17T15:23:57Z) - PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z) - Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z) - TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model [18.910745982208965]
TalkingPose is a novel diffusion-based framework for producing temporally consistent human upper-body animations.<n>We introduce a feedback-driven mechanism built upon image-based diffusion models to ensure continuous motion and enhance temporal coherence.<n>We also introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
arXiv Detail & Related papers (2025-11-30T14:26:24Z) - X-NeMo: Expressive Neural Motion Reenactment via Disentangled Latent Attention [52.94097577075215]
X-NeMo is a zero-shot diffusion-based portrait animation pipeline.<n>It animates a static portrait using facial movements from a driving video of a different individual.
arXiv Detail & Related papers (2025-07-30T22:46:52Z) - EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation [58.41979933166173]
EvAnimate is the first method leveraging event streams as robust and precise motion cues for conditional human image animation.<n>High-quality and temporally coherent animations are achieved through a dual-branch architecture.<n>Experiment results show EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
arXiv Detail & Related papers (2025-03-24T11:05:41Z) - FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint.
This enhancement ensures a more consistent transformation of semantically similar content across frames.
Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z) - FAAC: Facial Animation Generation with Anchor Frame and Conditional
Control for Superior Fidelity and Editability [14.896554342627551]
We introduce a facial animation generation method that enhances both face identity fidelity and editing capabilities.
This approach incorporates the concept of an anchor frame to counteract the degradation of generative ability in original text-to-image models.
Our method's efficacy has been validated on multiple representative DreamBooth and LoRA models.
arXiv Detail & Related papers (2023-12-06T02:55:35Z) - Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation [27.700371215886683]
diffusion models have become the mainstream in visual generation research, owing to their robust generative capabilities.
In this paper, we propose a novel framework tailored for character animation.
By expanding the training data, our approach can animate arbitrary characters, yielding superior results in character animation compared to other image-to-video methods.
arXiv Detail & Related papers (2023-11-28T12:27:15Z) - MagicAnimate: Temporally Consistent Human Image Animation using
Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence.
Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion.
We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z) - High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion.
We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations.
In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.