Video Diffusion Models with Local-Global Context Guidance
- URL: http://arxiv.org/abs/2306.02562v1
- Date: Mon, 5 Jun 2023 03:32:27 GMT
- Title: Video Diffusion Models with Local-Global Context Guidance
- Authors: Siyuan Yang, Lu Zhang, Yu Liu, Zhizhuo Jiang and You He
- Abstract summary: We propose a Local-Global Context guided Video Diffusion model (LGC-VD) to capture multi-perception conditions for producing high-quality videos.
Our experiments demonstrate that the proposed method achieves favorable performance on video prediction, unconditional inference, and video generation.
- Score: 17.040535240422088
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have emerged as a powerful paradigm in video synthesis tasks
including prediction, generation, and interpolation. Due to the limitation of
the computational budget, existing methods usually implement conditional
diffusion models with an autoregressive inference pipeline, in which the future
fragment is predicted based on the distribution of adjacent past frames.
However, only the conditions from a few previous frames can't capture the
global temporal coherence, leading to inconsistent or even outrageous results
in long-term video prediction. In this paper, we propose a Local-Global Context
guided Video Diffusion model (LGC-VD) to capture multi-perception conditions
for producing high-quality videos in both conditional/unconditional settings.
In LGC-VD, the UNet is implemented with stacked residual blocks with
self-attention units, avoiding the undesirable computational cost in 3D Conv.
We construct a local-global context guidance strategy to capture the
multi-perceptual embedding of the past fragment to boost the consistency of
future prediction. Furthermore, we propose a two-stage training strategy to
alleviate the effect of noisy frames for more stable predictions. Our
experiments demonstrate that the proposed method achieves favorable performance
on video prediction, interpolation, and unconditional video generation. We
release code at https://github.com/exisas/LGC-VD.
Related papers
- AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - Zero-Shot Video Semantic Segmentation based on Pre-Trained Diffusion Models [96.97910688908956]
We introduce the first zero-shot approach for Video Semantic (VSS) based on pre-trained diffusion models.
We propose a framework tailored for VSS based on pre-trained image and video diffusion models.
Experiments show that our proposed approach outperforms existing zero-shot image semantic segmentation approaches.
arXiv Detail & Related papers (2024-05-27T08:39:38Z) - Spatial Decomposition and Temporal Fusion based Inter Prediction for
Learned Video Compression [59.632286735304156]
We propose a spatial decomposition and temporal fusion based inter prediction for learned video compression.
With the SDD-based motion model and long short-term temporal fusion, our proposed learned video can obtain more accurate inter prediction contexts.
arXiv Detail & Related papers (2024-01-29T03:30:21Z) - Video-GroundingDINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding [108.79026216923984]
Video grounding aims to localize a-temporal section in a video corresponding to an input text query.
This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task.
arXiv Detail & Related papers (2023-12-31T13:53:37Z) - Transform-Equivariant Consistency Learning for Temporal Sentence
Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video.
Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted.
In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z) - Efficient Video Prediction via Sparsely Conditioned Flow Matching [24.32740918613266]
We introduce a novel generative model for video prediction based on latent flow matching.
We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER.
arXiv Detail & Related papers (2022-11-26T14:18:50Z) - A unified model for continuous conditional video prediction [14.685237010856953]
Conditional video prediction tasks are normally solved by task-related models.
Almost all conditional video prediction models can only achieve discrete prediction.
In this paper, we propose a unified model that addresses these two issues at the same time.
arXiv Detail & Related papers (2022-10-11T22:26:59Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Masked Conditional Video Diffusion for Prediction, Generation, and
Interpolation [14.631523634811392]
Masked Conditional Video Diffusion (MCVD) is a general-purpose framework for video prediction.
We train the model in a manner where we randomly and independently mask all the past frames or all the future frames.
Our approach yields SOTA results across standard video prediction benchmarks, with computation times measured in 1-12 days.
arXiv Detail & Related papers (2022-05-19T20:58:05Z) - Versatile Learned Video Compression [26.976302025254043]
We propose a versatile learned video compression (VLVC) framework that uses one model to support all possible prediction modes.
Specifically, to realize versatile compression, we first build a motion compensation module that applies multiple 3D motion vector fields.
We show that the flow prediction module can largely reduce the transmission cost of voxel flows.
arXiv Detail & Related papers (2021-11-05T10:50:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.