Efficient Video Prediction via Sparsely Conditioned Flow Matching
- URL: http://arxiv.org/abs/2211.14575v2
- Date: Thu, 24 Aug 2023 19:28:10 GMT
- Title: Efficient Video Prediction via Sparsely Conditioned Flow Matching
- Authors: Aram Davtyan, Sepehr Sameni, Paolo Favaro
- Abstract summary: We introduce a novel generative model for video prediction based on latent flow matching.
We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER.
- Score: 24.32740918613266
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce a novel generative model for video prediction based on latent
flow matching, an efficient alternative to diffusion-based models. In contrast
to prior work, we keep the high costs of modeling the past during training and
inference at bay by conditioning only on a small random set of past frames at
each integration step of the image generation process. Moreover, to enable the
generation of high-resolution videos and to speed up the training, we work in
the latent space of a pretrained VQGAN. Finally, we propose to approximate the
initial condition of the flow ODE with the previous noisy frame. This allows to
reduce the number of integration steps and hence, speed up the sampling at
inference time. We call our model Random frame conditioned flow Integration for
VidEo pRediction, or, in short, RIVER. We show that RIVER achieves superior or
on par performance compared to prior work on common video prediction
benchmarks, while requiring an order of magnitude fewer computational
resources.
Related papers
- Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models.
We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z) - A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models [14.859580045688487]
A practical bottleneck of diffusion models is their sampling speed.
We propose a novel framework capable of adaptively allocating compute required for the score estimation.
We show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality.
arXiv Detail & Related papers (2024-08-12T05:33:45Z) - Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected.
We address this issue from both inference and training aspects.
Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z) - Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame.
VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech.
Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z) - CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster
Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks.
However, their widespread adoption is hindered by the high computational cost, which limits their real-time application.
We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z) - HARP: Autoregressive Latent Video Prediction with High-Fidelity Image
Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames.
We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z) - Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research.
We propose a diffusion model for video generation that shows very promising initial results.
We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.