Related papers: Efficient Video Prediction via Sparsely Conditioned Flow Matching

Efficient Video Prediction via Sparsely Conditioned Flow Matching

URL: http://arxiv.org/abs/2211.14575v2
Date: Thu, 24 Aug 2023 19:28:10 GMT
Title: Efficient Video Prediction via Sparsely Conditioned Flow Matching
Authors: Aram Davtyan, Sepehr Sameni, Paolo Favaro
Abstract summary: We introduce a novel generative model for video prediction based on latent flow matching. We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER.
Score: 24.32740918613266
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We introduce a novel generative model for video prediction based on latent flow matching, an efficient alternative to diffusion-based models. In contrast to prior work, we keep the high costs of modeling the past during training and inference at bay by conditioning only on a small random set of past frames at each integration step of the image generation process. Moreover, to enable the generation of high-resolution videos and to speed up the training, we work in the latent space of a pretrained VQGAN. Finally, we propose to approximate the initial condition of the flow ODE with the previous noisy frame. This allows to reduce the number of integration steps and hence, speed up the sampling at inference time. We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER. We show that RIVER achieves superior or on par performance compared to prior work on common video prediction benchmarks, while requiring an order of magnitude fewer computational resources.

Related papers

Training-free Diffusion Acceleration with Bottleneck Sampling [37.9135035506567]
Bottleneck Sampling is a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity. It accelerates inference by up to 3$times$ for image generation and 2.5$times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process.
arXiv Detail & Related papers (2025-03-24T17:59:02Z)
Rethinking Video Tokenization: A Conditioned Diffusion-based Approach [58.164354605550194]
New tokenizer, Diffusion Conditioned-based Gene Tokenizer, replaces GAN-based decoder with conditional diffusion model. We trained using only a basic MSE diffusion loss for reconstruction, along with KL term and LPIPS perceptual loss from scratch. Even a scaled-down version of CDT (3$times inference speedup) still performs comparably with top baselines.
arXiv Detail & Related papers (2025-03-05T17:59:19Z)
Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction. With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA. Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
Efficient Continuous Video Flow Model for Video Prediction [43.16308241800144]
Multi-step prediction models, such as diffusion and rectified flow models, exhibit higher latency in sampling new frames compared to single-step methods. We propose a novel approach to modeling the multi-step process, aimed at alleviating latency constraints and facilitating the adaptation of such processes for video prediction tasks.
arXiv Detail & Related papers (2024-12-07T12:11:25Z)
Fast constrained sampling in pre-trained diffusion models [77.21486516041391]
Diffusion models have dominated the field of large, generative image models. We propose an algorithm for fast-constrained sampling in large pre-trained diffusion models.
arXiv Detail & Related papers (2024-10-24T14:52:38Z)
A Simple Early Exiting Framework for Accelerated Sampling in Diffusion Models [14.859580045688487]
A practical bottleneck of diffusion models is their sampling speed. We propose a novel framework capable of adaptively allocating compute required for the score estimation. We show that our method could significantly improve the sampling throughput of the diffusion models without compromising image quality.
arXiv Detail & Related papers (2024-08-12T05:33:45Z)
Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model [31.70050311326183]
Diffusion models tend to generate videos with less motion than expected. We address this issue from both inference and training aspects. Our methods outperform baselines by producing higher motion scores with lower errors.
arXiv Detail & Related papers (2024-06-22T04:56:16Z)
Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame. VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z)
Make a Cheap Scaling: A Self-Cascade Diffusion Model for Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation. Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters. Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z)
Guided Flows for Generative Modeling and Decision Making [55.42634941614435]
We show that Guided Flows significantly improves the sample quality in conditional image generation and zero-shot text synthesis-to-speech. Notably, we are first to apply flow models for plan generation in the offline reinforcement learning setting ax speedup in compared to diffusion models.
arXiv Detail & Related papers (2023-11-22T15:07:59Z)
CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation [49.3016007471979]
Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks. However, their widespread adoption is hindered by the high computational cost, which limits their real-time application. We introduce a novel method dubbed CoDi, that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs.
arXiv Detail & Related papers (2023-10-02T17:59:18Z)
HARP: Autoregressive Latent Video Prediction with High-Fidelity Image Generator [90.74663948713615]
We train an autoregressive latent video prediction model capable of predicting high-fidelity future frames. We produce high-resolution (256x256) videos with minimal modification to existing models.
arXiv Detail & Related papers (2022-09-15T08:41:57Z)
Video Diffusion Models [47.99413440461512]
Generating temporally coherent high fidelity video is an important milestone in generative modeling research. We propose a diffusion model for video generation that shows very promising initial results. We present the first results on a large text-conditioned video generation task, as well as state-of-the-art results on an established unconditional video generation benchmark.
arXiv Detail & Related papers (2022-04-07T14:08:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.