Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation
- URL: http://arxiv.org/abs/2512.21734v2
- Date: Mon, 29 Dec 2025 03:22:01 GMT
- Title: Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation
- Authors: Steven Xiao, Xindi Zhang, Dechao Meng, Qi Wang, Peng Zhang, Bang Zhang,
- Abstract summary: We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
- Score: 16.692450893925148
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time portrait animation is essential for interactive applications such as virtual assistants and live avatars, requiring high visual fidelity, temporal coherence, ultra-low latency, and responsive control from dynamic inputs like reference images and driving signals. While diffusion-based models achieve strong quality, their non-causal nature hinders streaming deployment. Causal autoregressive video generation approaches enable efficient frame-by-frame generation but suffer from error accumulation, motion discontinuities at chunk boundaries, and degraded long-term consistency. In this work, we present a novel streaming framework named Knot Forcing for real-time portrait animation that addresses these challenges through three key designs: (1) a chunk-wise generation strategy with global identity preservation via cached KV states of the reference image and local temporal modeling using sliding window attention; (2) a temporal knot module that overlaps adjacent chunks and propagates spatio-temporal cues via image-to-video conditioning to smooth inter-chunk motion transitions; and (3) A "running ahead" mechanism that dynamically updates the reference frame's temporal coordinate during inference, keeping its semantic context ahead of the current rollout frame to support long-term coherence. Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences, achieving real-time performance with strong visual stability on consumer-grade GPUs.
Related papers
- StableWorld: Towards Stable and Consistent Long Interactive Video Generation [45.597087309159456]
We investigate the overlooked challenge of stability and temporal consistency in interactive video generation.<n>We propose a simple yet effective method, textbfStableWorld, a Dynamic Frame Eviction Mechanism.<n>StableWorld effectively prevents cumulative drift at its source, leading to more stable and temporal consistency of interactive generation.
arXiv Detail & Related papers (2026-01-21T18:59:02Z) - Screen, Match, and Cache: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation [44.20260674331104]
FrameCache is a training-free three-stage framework consisting of Screen, Cache, and Match.<n>Experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability.
arXiv Detail & Related papers (2025-12-13T08:45:03Z) - Live Avatar: Streaming Real-time Audio-Driven Avatar Generation with Infinite Length [57.458450695137664]
We present Live Avatar, an algorithm-system co-designed framework for efficient, high-fidelity, and infinite-length avatar generation.<n>Live Avatar is first to achieve practical, real-time, high-fidelity avatar generation at this scale.
arXiv Detail & Related papers (2025-12-04T11:11:24Z) - TalkingPose: Efficient Face and Gesture Animation with Feedback-guided Diffusion Model [18.910745982208965]
TalkingPose is a novel diffusion-based framework for producing temporally consistent human upper-body animations.<n>We introduce a feedback-driven mechanism built upon image-based diffusion models to ensure continuous motion and enhance temporal coherence.<n>We also introduce a comprehensive, large-scale dataset to serve as a new benchmark for human upper-body animation.
arXiv Detail & Related papers (2025-11-30T14:26:24Z) - MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - Lookahead Anchoring: Preserving Character Identity in Audio-Driven Human Animation [75.71558917038838]
Lookahead Anchoring prevents identity drift during temporal autoregressive generation.<n>It transforms from fixed boundaries into directional beacons.<n>It also enables self-key-framing, where the reference image serves as the lookahead target.
arXiv Detail & Related papers (2025-10-27T17:50:19Z) - LILAC: Long-sequence Incremental Low-latency Arbitrary Motion Stylization via Streaming VAE-Diffusion with Causal Decoding [5.946860384629338]
LILAC builds upon a recent high-performing offline framework for arbitrary motion stylization.<n>It extends it to an online setting through a latent-space streaming architecture with a sliding-window causal design.<n>This architecture enables real-time arbitrary stylization without relying on future frames or modifying the diffusion model architecture.
arXiv Detail & Related papers (2025-10-17T07:45:43Z) - Towards Robust and Generalizable Continuous Space-Time Video Super-Resolution with Events [71.2439653098351]
Continuous space-time video super-STVSR has garnered increasing interest for its capability to reconstruct high-resolution and high-frame-rate videos at arbitrary temporal scales.<n>We present EvEnhancer, a novel approach that marries unique properties of high temporal and high dynamic range encapsulated in event streams.<n>Our method achieves state-of-the-art performance on both synthetic and real-world datasets, while maintaining generalizability at OOD scales.
arXiv Detail & Related papers (2025-10-04T15:23:07Z) - Stable Video-Driven Portraits [52.008400639227034]
Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
arXiv Detail & Related papers (2025-09-22T08:11:08Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - Intrinsic Temporal Regularization for High-resolution Human Video
Synthesis [59.54483950973432]
temporal consistency is crucial for extending image processing pipelines to the video domain.
We propose an effective intrinsic temporal regularization scheme, where an intrinsic confidence map is estimated via the frame generator to regulate motion estimation.
We apply our intrinsic temporal regulation to single-image generator, leading to a powerful " INTERnet" capable of generating $512times512$ resolution human action videos.
arXiv Detail & Related papers (2020-12-11T05:29:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.