UltraGen: High-Resolution Video Generation with Hierarchical Attention
- URL: http://arxiv.org/abs/2510.18775v1
- Date: Tue, 21 Oct 2025 16:23:21 GMT
- Title: UltraGen: High-Resolution Video Generation with Hierarchical Attention
- Authors: Teng Hu, Jiangning Zhang, Zihan Su, Ran Yi,
- Abstract summary: UltraGen is a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis.<n>We show that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time.
- Score: 62.99161115650818
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video generation have made it possible to produce visually compelling videos, with wide-ranging applications in content creation, entertainment, and virtual reality. However, most existing diffusion transformer based video generation models are limited to low-resolution outputs (<=720P) due to the quadratic computational complexity of the attention mechanism with respect to the output width and height. This computational bottleneck makes native high-resolution video generation (1080P/2K/4K) impractical for both training and inference. To address this challenge, we present UltraGen, a novel video generation framework that enables i) efficient and ii) end-to-end native high-resolution video synthesis. Specifically, UltraGen features a hierarchical dual-branch attention architecture based on global-local attention decomposition, which decouples full attention into a local attention branch for high-fidelity regional content and a global attention branch for overall semantic consistency. We further propose a spatially compressed global modeling strategy to efficiently learn global dependencies, and a hierarchical cross-window local attention mechanism to reduce computational costs while enhancing information flow across different local windows. Extensive experiments demonstrate that UltraGen can effectively scale pre-trained low-resolution video models to 1080P and even 4K resolution for the first time, outperforming existing state-of-the-art methods and super-resolution based two-stage pipelines in both qualitative and quantitative evaluations.
Related papers
- SemanticGen: Video Generation in Semantic Space [60.49729308406981]
State-of-the-art video generative models typically learn the distribution of video latents in the VAE space and map them to pixels using a VAE decoder.<n>We introduce SemanticGen, a novel solution to generate videos in the semantic space.<n>Our method is also effective and computationally efficient when extended to long video generation.
arXiv Detail & Related papers (2025-12-23T18:59:56Z) - Scale-DiT: Ultra-High-Resolution Image Generation with Hierarchical Local Attention [50.391914489898774]
Scale-DiT is a new diffusion framework that introduces hierarchical local attention with low-resolution global guidance.<n>A lightweight LoRA adaptation bridges global and local pathways during denoising, ensuring consistency across structure and detail.<n>Experiments demonstrate that Scale-DiT achieves more than $2times$ faster inference and lower memory usage.
arXiv Detail & Related papers (2025-10-18T03:15:26Z) - SuperGen: An Efficient Ultra-high-resolution Video Generation System with Sketching and Tiling [27.96742776792205]
SuperGen is an efficient tile-based framework for ultra-high-resolution video generation.<n>It supports a wide range of resolutions without additional training efforts.<n>SuperGen incorporates a tile-tailored, adaptive, region-aware caching strategy.
arXiv Detail & Related papers (2025-08-25T07:49:17Z) - CineScale: Free Lunch in High-Resolution Cinematic Visual Generation [42.81729840016782]
We propose CineScale, a novel inference paradigm to enable higher-resolution visual generation.<n>Our approach enables 8k image generation without any fine-tuning, and achieves 4k video generation with only minimal LoRA fine-tuning.
arXiv Detail & Related papers (2025-08-21T17:59:57Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - Turbo2K: Towards Ultra-Efficient and High-Quality 2K Video Synthesis [50.77548592888096]
Demand for 2K video synthesis is rising with increasing consumer expectations for ultra-clear visuals.<n>Turbo2K is an efficient framework for generating detail-rich 2K videos.
arXiv Detail & Related papers (2025-04-20T03:30:59Z) - Hierarchical Patch Diffusion Models for High-Resolution Video Generation [50.42746357450949]
We develop deep context fusion, which propagates context information from low-scale to high-scale patches in a hierarchical manner.
We also propose adaptive computation, which allocates more network capacity and computation towards coarse image details.
The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation.
arXiv Detail & Related papers (2024-06-12T01:12:53Z) - An Efficient Recurrent Adversarial Framework for Unsupervised Real-Time
Video Enhancement [132.60976158877608]
We propose an efficient adversarial video enhancement framework that learns directly from unpaired video examples.
In particular, our framework introduces new recurrent cells that consist of interleaved local and global modules for implicit integration of spatial and temporal information.
The proposed design allows our recurrent cells to efficiently propagate-temporal-information across frames and reduces the need for high complexity networks.
arXiv Detail & Related papers (2020-12-24T00:03:29Z) - HRVGAN: High Resolution Video Generation using Spatio-Temporal GAN [0.0]
We propose a novel deep generative network architecture designed specifically for high-resolution video synthesis.<n>Our approach integrates key concepts from Wasserstein Generative Adrial Networks (WGANs)<n>Our training objective combines a pixel-wise mean squared error loss with an adversarial loss to balance frame-level accuracy and video realism.
arXiv Detail & Related papers (2020-08-17T20:45:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.