Related papers: Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models

URL: http://arxiv.org/abs/2504.12626v3
Date: Tue, 14 Oct 2025 23:28:39 GMT
Title: Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models
Authors: Lvmin Zhang, Shengqu Cai, Muyang Li, Gordon Wetzstein, Maneesh Agrawala,
Abstract summary: We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation.<n>FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length.<n>We show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.
Score: 63.99949971803903
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation. FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length, with more important frames having longer contexts. The frame importance can be measured using time proximity, feature similarity, or hybrid metrics. The packing method allows for inference with thousands of frames and training with relatively large batch sizes. We also present drift prevention methods to address observation bias (error accumulation), including early-established endpoints, adjusted sampling orders, and discrete history representation. Ablation studies validate the effectiveness of the anti-drifting methods in both single-directional video streaming and bi-directional video generation. Finally, we show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.

Related papers

Beyond Boundary Frames: Audio-Visual Semantic Guidance for Context-Aware Video Interpolation [14.00347197658315]
BBF is a context-aware video frame framework guided by audio/visual semantics.<n>We show that BBF outperforms specialized state-of-the-art methods on both generic and audio-visual synchronized tasks.
arXiv Detail & Related papers (2025-12-03T09:22:13Z)
Generative Inbetweening through Frame-wise Conditions-Driven Video Generation [63.43583844248389]
generative inbetweening aims to generate intermediate frame sequences by utilizing two key frames as input.<n>We propose a Frame-wise Conditions-driven Video Generation (FCVG) method that significantly enhances the temporal stability of interpolated video frames.<n>Our FCVG demonstrates the capability to generate temporally stable videos using both linear and non-linear curves.
arXiv Detail & Related papers (2024-12-16T13:19:41Z)
Continuous Video Process: Modeling Videos as Continuous Multi-Dimensional Processes for Video Prediction [43.16308241800144]
We introduce a novel model class, that treats video as a continuous multi-dimensional process rather than a series of discrete frames.<n>We establish state-of-the-art performance in video prediction, validated on benchmark datasets including KTH, BAIR, Human3.6M, and UCF101.
arXiv Detail & Related papers (2024-12-06T10:34:50Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation. We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning. Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
ZeroSmooth: Training-free Diffuser Adaptation for High Frame Rate Video Generation [81.90265212988844]
We propose a training-free video method for generative video models in a plug-and-play manner. We transform a video model into a self-cascaded video diffusion model with the designed hidden state correction modules. Our training-free method is even comparable to trained models supported by huge compute resources and large-scale datasets.
arXiv Detail & Related papers (2024-06-03T00:31:13Z)
Aggregating Nearest Sharp Features via Hybrid Transformers for Video Deblurring [70.06559269075352]
We propose a video deblurring method that leverages both neighboring frames and existing sharp frames using hybrid Transformers for feature aggregation.<n>To aggregate nearest sharp features from detected sharp frames, we utilize a global Transformer with multi-scale matching capability.<n>Our proposed method outperforms state-of-the-art video deblurring methods as well as event-driven video deblurring methods in terms of quantitative metrics and visual quality.
arXiv Detail & Related papers (2023-09-13T16:12:11Z)
Rerender A Video: Zero-Shot Text-Guided Video-to-Video Translation [93.18163456287164]
This paper proposes a novel text-guided video-to-video translation framework to adapt image models to videos. Our framework achieves global style and local texture temporal consistency at a low cost.
arXiv Detail & Related papers (2023-06-13T17:52:23Z)
TTVFI: Learning Trajectory-Aware Transformer for Video Frame Interpolation [50.49396123016185]
Video frame (VFI) aims to synthesize an intermediate frame between two consecutive frames. We propose a novel Trajectory-aware Transformer for Video Frame Interpolation (TTVFI) Our method outperforms other state-of-the-art methods in four widely-used VFI benchmarks.
arXiv Detail & Related papers (2022-07-19T03:37:49Z)
Optimizing Video Prediction via Video Frame Interpolation [53.16726447796844]
We present a new optimization framework for video prediction via video frame, inspired by photo-realistic results of video framescapes. Our framework is based on optimization with a pretrained differentiable video frame module without the need for a training dataset. Our approach outperforms other video prediction methods that require a large amount of training data or extra semantic information.
arXiv Detail & Related papers (2022-06-27T17:03:46Z)
Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction. We train the model in a manner where we randomly and independently mask all the past frames or all the future frames. Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z)
SwinBERT: End-to-End Transformers with Sparse Attention for Video Captioning [40.556222166309524]
We present SwinBERT, an end-to-end transformer-based model for video captioning. Our method adopts a video transformer to encode spatial-temporal representations that can adapt to variable lengths of video input. Based on this model architecture, we show that video captioning can benefit significantly from more densely sampled video frames.
arXiv Detail & Related papers (2021-11-25T18:02:12Z)
PDWN: Pyramid Deformable Warping Network for Video Interpolation [11.62213584807003]
We propose a light but effective model, called Pyramid Deformable Warping Network (PDWN) PDWN uses a pyramid structure to generate DConv offsets of the unknown middle frame with respect to the known frames through coarse-to-fine successive refinements. Our method achieves better or on-par accuracy compared to state-of-the-art models on multiple datasets.
arXiv Detail & Related papers (2021-04-04T02:08:57Z)
Deep Sketch-guided Cartoon Video Inbetweening [24.00033622396297]
We propose a framework to produce cartoon videos by fetching the color information from two inputs while following the animated motion guided by a user sketch. By explicitly considering the correspondence between frames and the sketch, we can achieve higher quality results than other image synthesis methods.
arXiv Detail & Related papers (2020-08-10T14:22:04Z)
Efficient Semantic Video Segmentation with Per-frame Inference [117.97423110566963]
In this work, we process efficient semantic video segmentation in a per-frame fashion during the inference process. We employ compact models for real-time execution. To narrow the performance gap between compact models and large models, new knowledge distillation methods are designed.
arXiv Detail & Related papers (2020-02-26T12:24:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.