Pretraining Frame Preservation in Autoregressive Video Memory Compression
- URL: http://arxiv.org/abs/2512.23851v2
- Date: Sun, 04 Jan 2026 13:49:37 GMT
- Title: Pretraining Frame Preservation in Autoregressive Video Memory Compression
- Authors: Lvmin Zhang, Shengqu Cai, Muyang Li, Chong Zeng, Beijia Lu, Anyi Rao, Song Han, Gordon Wetzstein, Maneesh Agrawala,
- Abstract summary: We present PFP, a neural network structure to compress long videos into short contexts.<n>The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances.<n>We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
- Score: 65.4614111198843
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We present PFP, a neural network structure to compress long videos into short contexts, with an explicit pretraining objective to preserve the high-frequency details of single frames at arbitrary temporal positions. The baseline model can compress a 20-second video into a context at about 5k length, where random frames can be retrieved with perceptually preserved appearances. Such pretrained models can be directly fine-tuned as memory encoders for autoregressive video models, enabling long history memory with low context cost and relatively low fidelity loss. We evaluate the framework with ablative settings and discuss the trade-offs of possible neural architecture designs.
Related papers
- Autoregressive Video Autoencoder with Decoupled Temporal and Spatial Context [8.458436768725212]
Video autoencoders compress videos into compact latent representations for efficient reconstruction.<n>We propose Autoregressive Video Autoencoder (ARVAE), which compresses and reconstructs each frame conditioned on its predecessor in an autoregressive manner.<n>ARVAE achieves superior reconstruction quality with extremely lightweight models and small-scale training data.
arXiv Detail & Related papers (2025-12-12T05:40:01Z) - Episodic Memory Representation for Long-form Video Understanding [52.33907540905242]
Large Video Language Models excel at general video understanding but struggle with long-form context window limits.<n>We introduce Video-EM, a training free framework inspired by the principles of human memory.<n>Video-EM achieves performance gains of 4-9 percent over respective baselines while utilizing fewer frames.
arXiv Detail & Related papers (2025-08-13T04:33:07Z) - FRAME: Pre-Training Video Feature Representations via Anticipation and Memory [55.046881477209695]
FRAME is a self-supervised video frame encoder tailored for dense video understanding.<n>It learns to predict current and future DINO patch features from past and present RGB frames.<n>It consistently outperforms image encoders and existing self-supervised video models.
arXiv Detail & Related papers (2025-06-05T19:44:47Z) - Frame Context Packing and Drift Prevention in Next-Frame-Prediction Video Diffusion Models [63.99949971803903]
We present a neural network structure, FramePack, to train next-frame (or next-frame-section) prediction models for video generation.<n>FramePack compresses input frame contexts with frame-wise importance so that more frames can be encoded within a fixed context length.<n>We show that existing video diffusion models can be finetuned with FramePack, and analyze the differences between different packing schedules.
arXiv Detail & Related papers (2025-04-17T04:02:31Z) - Long-Context Autoregressive Video Modeling with Next-Frame Prediction [17.710915002557996]
Long-context video modeling is essential for enabling generative models to function as world simulators.<n>While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive.<n>We propose Frame AutoRegressive (FAR) models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models.
arXiv Detail & Related papers (2025-03-25T03:38:06Z) - UAR-NVC: A Unified AutoRegressive Framework for Memory-Efficient Neural Video Compression [32.46672370851282]
Implicit Neural Representations (INRs) have demonstrated significant potential in video compression by representing videos as neural networks.<n>We present a novel understanding of INR models from an autoregressive (AR) perspective and introduce a Unified AutoRegressive Framework for memory-efficient Neural Video Compression (UAR-NVC)<n>UAR-NVC integrates timeline-based and INR-based neural video compression under a unified autoregressive paradigm.
arXiv Detail & Related papers (2025-03-04T15:54:57Z) - A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z) - Beyond Short Clips: End-to-End Video-Level Learning with Collaborative
Memories [56.91664227337115]
We introduce a collaborative memory mechanism that encodes information across multiple sampled clips of a video at each training iteration.
This enables the learning of long-range dependencies beyond a single clip.
Our proposed framework is end-to-end trainable and significantly improves the accuracy of video classification at a negligible computational overhead.
arXiv Detail & Related papers (2021-04-02T18:59:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.