Screen, Match, and Cache: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
- URL: http://arxiv.org/abs/2601.22160v1
- Date: Sat, 13 Dec 2025 08:45:03 GMT
- Title: Screen, Match, and Cache: A Training-Free Causality-Consistent Reference Frame Framework for Human Animation
- Authors: Jianan Wang, Nailei Hei, Li He, Huanzhen Wang, Aoxing Li, Haofen Wang, Yan Wang, Wenqiang Zhang,
- Abstract summary: FrameCache is a training-free three-stage framework consisting of Screen, Cache, and Match.<n>Experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability.
- Score: 44.20260674331104
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human animation aims to generate temporally coherent and visually consistent videos over long sequences, yet modeling long-range dependencies while preserving frame quality remains challenging. Inspired by the human ability to leverage past observations for interpreting ongoing actions, we propose FrameCache, a training-free three-stage framework consisting of Screen, Cache, and Match. In the Screen stage, a multi-dimensional, quality-aware mechanism with adaptive thresholds dynamically selects informative frames; the Cache stage maintains a reference pool using a dynamic replacement-hit strategy, preserving both diversity and relevance; and the Match stage extracts behavioral features to perform motion-consistent reference matching for coherent animation guidance. Extensive experiments on standard benchmarks demonstrate that FrameCache consistently improves temporal coherence and visual stability while integrating seamlessly with diverse baselines. Despite these encouraging results, further analysis reveals that its effectiveness depends on baseline temporal reasoning and real-synthetic consistency, motivating future work on compatibility conditions and adaptive cache mechanisms. Code will be made publicly available.
Related papers
- Knot Forcing: Taming Autoregressive Video Diffusion Models for Real-time Infinite Interactive Portrait Animation [16.692450893925148]
We present a novel streaming framework named Knot Forcing for real-time portrait animation.<n>K Knot Forcing enables high-fidelity, temporally consistent, and interactive portrait animation over infinite sequences.
arXiv Detail & Related papers (2025-12-25T16:34:56Z) - Learning Spatio-Temporal Feature Representations for Video-Based Gaze Estimation [50.05866669110754]
Video-based gaze estimation methods aim to capture the inherently temporal dynamics of human eye gaze from multiple image frames.<n>We propose the Spatio-Temporal Gaze Network (ST-Gaze), a model that combines a CNN backbone with dedicated channel attention and self-attention modules.<n>We show that ST-Gaze achieves state-of-the-art performance both with and without person-specific adaptation.
arXiv Detail & Related papers (2025-12-19T15:15:58Z) - Zero-Shot Video Translation and Editing with Frame Spatial-Temporal Correspondence [81.82643953694485]
We present FRESCO, which integrates intra-frame correspondence with inter-frame correspondence to formulate a more robust spatial-temporal constraint.<n>Our method goes beyond attention guidance to explicitly optimize features, achieving high spatial-temporal consistency with the input video.<n>We verify FRESCO adaptations on two zero-shot tasks of video-to-video translation and text-guided video editing.
arXiv Detail & Related papers (2025-12-03T15:51:11Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process.
We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.