Related papers: TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model

URL: http://arxiv.org/abs/2512.00909v1
Date: Sun, 30 Nov 2025 14:26:24 GMT
Title: TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model
Authors: Alireza Javanmardi, Pragati Jaiswal, Tewodros Amberbir Habtegebrial, Christen Millerdurai, Shaoxiang Wang, Alain Pagani, Didier Stricker,
Abstract summary: TalkingPose is a novel diffusion-based framework for producing temporally consistent human upper-body animations.<n>We introduce a feedback-driven mechanism built upon image-based diffusion models to ensure continuous motion and enhance temporal coherence.<n>We also introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
Score: 18.910745982208965
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Recent advancements in diffusion models have significantly improved the realism and generalizability of character-driven animation, enabling the synthesis of high-quality motion from just a single RGB image and a set of driving poses. Nevertheless, generating temporally coherent long-form content remains challenging. Existing approaches are constrained by computational and memory limitations, as they are typically trained on short video segments, thus performing effectively only over limited frame lengths and hindering their potential for extended coherent generation. To address these constraints, we propose TalkingPose, a novel diffusion-based framework specifically designed for producing long-form, temporally consistent human upper-body animations. TalkingPose leverages driving frames to precisely capture expressive facial and hand movements, transferring these seamlessly to a target actor through a stable diffusion backbone. To ensure continuous motion and enhance temporal coherence, we introduce a feedback-driven mechanism built upon image-based diffusion models. Notably, this mechanism does not incur additional computational costs or require secondary training stages, enabling the generation of animations with unlimited duration. Additionally, we introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.

Related papers

High-Fidelity and Long-Duration Human Image Animation with Diffusion Transformer [17.388852038062705]
We propose a diffusion transformer (DiT)-based framework which focuses on generating high-fidelity and long-duration human animation videos.<n>First, we design a set of hybrid implicit guidance signals and a sharpness guidance factor, enabling our framework to additionally incorporate detailed facial and hand features as guidance.<n>Next, we incorporate the time-aware position shift fusion module, modify the input format within the DiT backbone, and refer to this mechanism as the Position Shift Adaptive Module.
arXiv Detail & Related papers (2025-12-26T07:36:48Z)
Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation [16.692450893925148]
We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
arXiv Detail & Related papers (2025-12-25T16:34:56Z)
PersonaLive! Expressive Portrait Image Animation for Live Streaming [53.63615310186964]
PersonaLive is a novel diffusion-based framework towards streaming real-time portrait animation.<n>We first adopt hybrid implicit signals, namely implicit facial representations and 3D implicit keypoints, to achieve expressive image-level motion control.<n>Experiments demonstrate that PersonaLive achieves state-of-the-art performance with up to 7-22x speedup over prior diffusion-based portrait animation models.
arXiv Detail & Related papers (2025-12-12T03:24:40Z)
Stable Video-Driven Portraits [52.008400639227034]
Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
arXiv Detail & Related papers (2025-09-22T08:11:08Z)
AvatarSync: Rethinking Talking-Head Animation through Phoneme-Guided Autoregressive Perspective [15.69417162113696]
AvatarSync is an autoregressive framework on phoneme representations that generates realistic talking-head animations from a single reference image.<n>We show that AvatarSync outperforms existing talking-head animation methods in visual fidelity, temporal consistency, and computational efficiency.
arXiv Detail & Related papers (2025-09-15T15:34:02Z)
HyperMotion: DiT-Based Pose-Guided Human Image Animation of Complex Motions [12.46263584777151]
We introduce the textbfOpen-HyperMotionX dataset and textbfHyperMotionX Bench, which provide high-quality human pose annotations and curated video clips.<n>We also propose a simple yet powerful DiT-based video generation baseline and design spatial low-frequency enhanced RoPE.<n>Our method significantly improves structural stability and appearance consistency in highly dynamic human motion sequences.
arXiv Detail & Related papers (2025-05-29T01:30:46Z)
EvAnimate: Event-conditioned Image-to-Video Generation for Human Animation [58.41979933166173]
EvAnimate is the first method leveraging event streams as robust and precise motion cues for conditional human image animation.<n>High-quality and temporally coherent animations are achieved through a dual-branch architecture.<n>Experiment results show EvAnimate achieves high temporal fidelity and robust performance in scenarios where traditional video-derived cues fall short.
arXiv Detail & Related papers (2025-03-24T11:05:41Z)
Long-Term TalkingFace Generation via Motion-Prior Conditional Diffusion Model [64.11605839142348]
We introduce the textbfMotion-priors textbfConditional textbfDiffusion textbfModel (textbfMCDM), which utilizes both archived and current clip motion priors to enhance motion prediction and ensure temporal consistency.<n>We also release the textbfTalkingFace-Wild dataset, a multilingual collection of over 200 hours of footage across 10 languages.
arXiv Detail & Related papers (2025-02-13T17:50:23Z)
UniAnimate: Taming Unified Video Diffusion Models for Consistent Human Image Animation [53.16986875759286]
We present a UniAnimate framework to enable efficient and long-term human video generation. We map the reference image along with the posture guidance and noise video into a common feature space. We also propose a unified noise input that supports random noised input as well as first frame conditioned input.
arXiv Detail & Related papers (2024-06-03T10:51:10Z)
HumMUSS: Human Motion Understanding using State Space Models [6.821961232645209]
We propose a novel attention-free model for human motion understanding building upon recent advancements in state space models. Our model supports both offline and real-time applications. For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches.
arXiv Detail & Related papers (2024-04-16T19:59:21Z)
FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation [85.29772293776395]
We introduce FRESCO, intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video.
arXiv Detail & Related papers (2024-03-19T17:59:18Z)
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model [74.84435399451573]
This paper studies the human image animation task, which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. We introduce MagicAnimate, a diffusion-based framework that aims at enhancing temporal consistency, preserving reference image faithfully, and improving animation fidelity.
arXiv Detail & Related papers (2023-11-27T18:32:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.