FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
- URL: http://arxiv.org/abs/2502.05179v3
- Date: Fri, 14 Mar 2025 02:41:07 GMT
- Title: FlashVideo: Flowing Fidelity to Detail for Efficient High-Resolution Video Generation
- Authors: Shilong Zhang, Wenbo Li, Shoufa Chen, Chongjian Ge, Peize Sun, Yida Zhang, Yi Jiang, Zehuan Yuan, Binyue Peng, Ping Luo,
- Abstract summary: DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale.<n>High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs)<n>We propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality.
- Score: 61.61415607972597
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: DiT diffusion models have achieved great success in text-to-video generation, leveraging their scalability in model capacity and data scale. High content and motion fidelity aligned with text prompts, however, often require large model parameters and a substantial number of function evaluations (NFEs). Realistic and visually appealing details are typically reflected in high resolution outputs, further amplifying computational demands especially for single stage DiT models. To address these challenges, we propose a novel two stage framework, FlashVideo, which strategically allocates model capacity and NFEs across stages to balance generation fidelity and quality. In the first stage, prompt fidelity is prioritized through a low resolution generation process utilizing large parameters and sufficient NFEs to enhance computational efficiency. The second stage establishes flow matching between low and high resolutions, effectively generating fine details with minimal NFEs. Quantitative and visual results demonstrate that FlashVideo achieves state-of-the-art high resolution video generation with superior computational efficiency. Additionally, the two-stage design enables users to preview the initial output and accordingly adjust the prompt before committing to full-resolution generation, thereby significantly reducing computational costs and wait times as well as enhancing commercial viability.
Related papers
- DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.
These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - MVQA: Mamba with Unified Sampling for Efficient Video Quality Assessment [24.053542031123985]
We present MVQA, a Mamba-based model designed for efficient video quality assessment (VQA)
USDS combines semantic patch sampling from low-resolution videos and distortion patch sampling from original-resolution videos.
Experiments show that the proposed MVQA, equipped with USDS, achieve comparable performance to state-of-the-art methods.
arXiv Detail & Related papers (2025-04-22T16:08:23Z) - HiFlow: Training-free High-Resolution Image Generation with Flow-Aligned Guidance [70.69373563281324]
HiFlow is a training-free and model-agnostic framework to unlock the resolution potential of pre-trained flow models.
HiFlow substantially elevates the quality of high-resolution image synthesis of T2I models.
arXiv Detail & Related papers (2025-04-08T17:30:40Z) - Training-free Diffusion Acceleration with Bottleneck Sampling [37.9135035506567]
Bottleneck Sampling is a training-free framework that leverages low-resolution priors to reduce computational overhead while preserving output fidelity.
It accelerates inference by up to 3$times$ for image generation and 2.5$times$ for video generation, all while maintaining output quality comparable to the standard full-resolution sampling process.
arXiv Detail & Related papers (2025-03-24T17:59:02Z) - Multimodal Instruction Tuning with Hybrid State Space Models [25.921044010033267]
Long context is crucial for enhancing the recognition and understanding capabilities of multimodal large language models.
We propose a novel approach using a hybrid transformer-MAMBA model to efficiently handle long contexts in multimodal applications.
Our model enhances inference efficiency for high-resolution images and high-frame-rate videos by about 4 times compared to current models.
arXiv Detail & Related papers (2024-11-13T18:19:51Z) - DiffuseHigh: Training-free Progressive High-Resolution Image Synthesis through Structure Guidance [11.44012694656102]
Large-scale generative models, such as text-to-image diffusion models, have garnered widespread attention across diverse domains.
Existing large-scale diffusion models are confined to generating images of up to 1K resolution.
We propose a novel progressive approach that fully utilizes generated low-resolution images to guide the generation of higher-resolution images.
arXiv Detail & Related papers (2024-06-26T16:10:31Z) - Video Interpolation with Diffusion Models [54.06746595879689]
We present VIDIM, a generative model for video, which creates short videos given a start and end frame.
VIDIM uses cascaded diffusion models to first generate the target video at low resolution, and then generate the high-resolution video conditioned on the low-resolution generated video.
arXiv Detail & Related papers (2024-04-01T15:59:32Z) - I2VGen-XL: High-Quality Image-to-Video Synthesis via Cascaded Diffusion
Models [54.99771394322512]
Video synthesis has recently made remarkable strides benefiting from the rapid development of diffusion models.
It still challenges encounters in terms of semantic accuracy, clarity, and continuity-temporal continuity.
We propose a cascaded I2VGen-XL approach that enhances model performance by decoupling these two factors.
I2VGen-XL can simultaneously enhance the semantic accuracy, continuity of details and clarity of generated videos.
arXiv Detail & Related papers (2023-11-07T17:16:06Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z) - Interpretable Detail-Fidelity Attention Network for Single Image
Super-Resolution [89.1947690981471]
We propose a purposeful and interpretable detail-fidelity attention network to progressively process smoothes and details in divide-and-conquer manner.
Particularly, we propose a Hessian filtering for interpretable feature representation which is high-profile for detail inference.
Experiments demonstrate that the proposed methods achieve superior performances over the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-28T08:31:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.