FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
- URL: http://arxiv.org/abs/2502.05179v1
- Date: Fri, 07 Feb 2025 18:59:59 GMT
- Title: FlashVideo:Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
- Authors: Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo,
- Abstract summary: DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale.
High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs)
We propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality.
- Score: 61.61415607972597
- License:
- Abstract: DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output before committing to full resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability .
Related papers
- Visual Autoregressive Modeling for Image Super-Resolution [14.935662351654601]
We propose a novel visual autoregressive modeling for ISR framework with the form of next-scale prediction.
We collect large-scale data and design a training process to obtain robust generative priors.
arXiv Detail & Related papers (2025-01-31T09:53:47Z) - DiffVSR: Enhancing Real-World Video Super-Resolution with Diffusion Models for Advanced Visual Quality and Temporal Consistency [25.756755602342942]
We present DiffVSR, a diffusion-based framework for real-world video super-resolution.
For intra-sequence coherence, we develop a multi-scale temporal attention module and temporal-enhanced VAE decoder.
We propose a progressive learning strategy that transitions from simple to complex degradations, enabling robust optimization.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.
These models struggle to maintain temporal consistency, as they are trained on static images.
We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z) - Multimodal Instruction Tuning with Hybrid State Space Models [25.921044010033267]
Long context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models.
We propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications.
Our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models.
arXiv Detail & Related papers (2024-11-13T18:19:51Z) - Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame.
VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z) - Make a Cheap Scaling: A Self-Cascade Diffusion Model for
Higher-Resolution Adaptation [112.08287900261898]
This paper proposes a novel self-cascade diffusion model for rapid adaptation to higher-resolution image and video generation.
Our approach achieves a 5X training speed-up and requires only an additional 0.002M tuning parameters.
Experiments demonstrate that our approach can quickly adapt to higher resolution image and video synthesis by fine-tuning for just 10k steps, with virtually no additional inference time.
arXiv Detail & Related papers (2024-02-16T07:48:35Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - Interpretable Detail-Fidelity Attention Network for Single Image
Super-Resolution [89.1947690981471]
We propose a purposeful and interpretable detail-fidelity attention network to progressively process smoothes and details in divide-and-conquer manner.
Particularly, we propose a Hessian filtering for interpretable feature representation which is high-profile for detail inference.
Experiments demonstrate that the proposed methods achieve superior performances over the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-28T08:31:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.