Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
- URL: http://arxiv.org/abs/2503.10704v1
- Date: Wed, 12 Mar 2025 15:32:44 GMT
- Title: Error Analyses of Auto-Regressive Video Diffusion Models: A Unified Framework
- Authors: Jing Wang, Fengzhuo Zhang, Xiaoli Li, Vincent Y. F. Tan, Tianyu Pang, Chao Du, Aixin Sun, Zhuoran Yang,
- Abstract summary: A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos.<n>We develop theoretical underpinnings for these models and use our insights to improve the performance of existing models.
- Score: 127.61297651993561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: A variety of Auto-Regressive Video Diffusion Models (ARVDM) have achieved remarkable successes in generating realistic long-form videos. However, theoretical analyses of these models remain scant. In this work, we develop theoretical underpinnings for these models and use our insights to improve the performance of existing models. We first develop Meta-ARVDM, a unified framework of ARVDMs that subsumes most existing methods. Using Meta-ARVDM, we analyze the KL-divergence between the videos generated by Meta-ARVDM and the true videos. Our analysis uncovers two important phenomena inherent to ARVDM -- error accumulation and memory bottleneck. By deriving an information-theoretic impossibility result, we show that the memory bottleneck phenomenon cannot be avoided. To mitigate the memory bottleneck, we design various network structures to explicitly use more past frames. We also achieve a significantly improved trade-off between the mitigation of the memory bottleneck and the inference efficiency by compressing the frames. Experimental results on DMLab and Minecraft validate the efficacy of our methods. Our experiments also demonstrate a Pareto-frontier between the error accumulation and memory bottleneck across different methods.
Related papers
- Temporal-Consistent Video Restoration with Pre-trained Diffusion Models [51.47188802535954]
Video restoration (VR) aims to recover high-quality videos from degraded ones.
Recent zero-shot VR methods using pre-trained diffusion models (DMs) suffer from approximation errors during reverse diffusion and insufficient temporal consistency.
We present a novel a Posterior Maximum (MAP) framework that directly parameterizes video frames in the seed space of DMs, eliminating approximation errors.
arXiv Detail & Related papers (2025-03-19T03:41:56Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - VACT: A Video Automatic Causal Testing System and a Benchmark [55.53300306960048]
VACT is an **automated** framework for modeling, evaluating, and measuring the causal understanding of VGMs in real-world scenarios.<n>We introduce multi-level causal evaluation metrics to provide a detailed analysis of the causal performance of VGMs.
arXiv Detail & Related papers (2025-03-08T10:54:42Z) - Ca2-VDM: Efficient Autoregressive Video Diffusion Model with Causal Generation and Cache Sharing [66.66090399385304]
Ca2-VDM is an efficient autoregressive VDM with Causal generation and Cache sharing.
For causal generation, it introduces unidirectional feature computation, which ensures that the cache of conditional frames can be precomputed in previous autoregression steps.
For cache sharing, it shares the cache across all denoising steps to avoid the huge cache storage cost.
arXiv Detail & Related papers (2024-11-25T13:33:41Z) - Investigating Memorization in Video Diffusion Models [58.70363256771246]
Diffusion models, widely used for image and video generation, face a significant limitation: the risk of memorizing and reproducing training data during inference.
We first formally define the two types of memorization in VDMs (content memorization and motion memorization) in a practical way.
We then introduce new metrics specifically designed to separately assess content and motion memorization in VDMs.
arXiv Detail & Related papers (2024-10-29T02:34:06Z) - Patch Spatio-Temporal Relation Prediction for Video Anomaly Detection [19.643936110623653]
Video Anomaly Detection (VAD) aims to identify abnormalities within a specific context and timeframe.
Recent deep learning-based VAD models have shown promising results by generating high-resolution frames.
We propose a self-supervised learning approach for VAD through an inter-patch relationship prediction task.
arXiv Detail & Related papers (2024-03-28T03:07:16Z) - Model Inversion Attack via Dynamic Memory Learning [41.742953947551364]
Model Inversion (MI) attacks aim to recover the private training data from the target model.
Recent advances in generative adversarial models have rendered them particularly effective in MI attacks.
We propose a novel Dynamic Memory Model Inversion Attack (DMMIA) to leverage historically learned knowledge.
arXiv Detail & Related papers (2023-08-24T02:32:59Z) - Unsupervised Video Anomaly Detection with Diffusion Models Conditioned
on Compact Motion Representations [17.816344808780965]
unsupervised video anomaly detection (VAD) problem involves classifying each frame in a video as normal or abnormal, without any access to labels.
To accomplish this, proposed method employs conditional diffusion models, where the input data is features extracted from pre-trained network.
Our method utilizes a data-driven threshold and considers a high reconstruction error as an indicator of anomalous events.
arXiv Detail & Related papers (2023-07-04T07:36:48Z) - Multi-level Memory-augmented Appearance-Motion Correspondence Framework
for Video Anomaly Detection [1.9511777443446219]
We propose a multi-level memory-augmented appearance-motion correspondence framework.
The latent correspondence between appearance and motion is explored via appearance-motion semantics alignment and semantics replacement training.
Our framework outperforms the state-of-the-art methods, achieving AUCs of 99.6%, 93.8%, and 76.3% on UCSD Ped2, CUHK Avenue, and ShanghaiTech datasets.
arXiv Detail & Related papers (2023-03-09T08:43:06Z) - SSMTL++: Revisiting Self-Supervised Multi-Task Learning for Video
Anomaly Detection [108.57862846523858]
We revisit the self-supervised multi-task learning framework, proposing several updates to the original method.
We modernize the 3D convolutional backbone by introducing multi-head self-attention modules.
In our attempt to further improve the model, we study additional self-supervised learning tasks, such as predicting segmentation maps.
arXiv Detail & Related papers (2022-07-16T19:25:41Z) - Large-scale Robustness Analysis of Video Action Recognition Models [10.017292176162302]
We study robustness of six state-of-the-art action recognition models against 90 different perturbations.
The study reveals some interesting findings, 1) transformer based models are consistently more robust compared to CNN based models, 2) Pretraining improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2.
arXiv Detail & Related papers (2022-07-04T13:29:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.