LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
- URL: http://arxiv.org/abs/2602.11564v1
- Date: Thu, 12 Feb 2026 04:35:16 GMT
- Title: LUVE : Latent-Cascaded Ultra-High-Resolution Video Generation with Dual Frequency Experts
- Authors: Chen Zhao, Jiawei Chen, Hongyu Li, Zhuoliang Kang, Shilin Lu, Xiaoming Wei, Kai Zhang, Jian Yang, Ying Tai,
- Abstract summary: textbfLUVE is a framework for ultra-high-resolution (UHR) video generation.<n>It employs a three-stage architecture for motion-consistent latent synthesis, video latent upsampling, and high-resolution content refinement.<n>LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component.
- Score: 48.279914726380376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances in video diffusion models have significantly improved visual quality, yet ultra-high-resolution (UHR) video generation remains a formidable challenge due to the compounded difficulties of motion modeling, semantic planning, and detail synthesis. To address these limitations, we propose \textbf{LUVE}, a \textbf{L}atent-cascaded \textbf{U}HR \textbf{V}ideo generation framework built upon dual frequency \textbf{E}xperts. LUVE employs a three-stage architecture comprising low-resolution motion generation for motion-consistent latent synthesis, video latent upsampling that performs resolution upsampling directly in the latent space to mitigate memory and computational overhead, and high-resolution content refinement that integrates low-frequency and high-frequency experts to jointly enhance semantic coherence and fine-grained detail generation. Extensive experiments demonstrate that our LUVE achieves superior photorealism and content fidelity in UHR video generation, and comprehensive ablation studies further validate the effectiveness of each component. The project is available at \href{https://unicornanrocinu.github.io/LUVE_web/}{https://github.io/LUVE/}.
Related papers
- One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion [57.824020826432815]
We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images.<n>We design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone.<n>We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process.
arXiv Detail & Related papers (2026-01-20T17:11:55Z) - Semantic and Temporal Integration in Latent Diffusion Space for High-Fidelity Video Super-Resolution [20.151571582095468]
We propose Semantic and Temporal Guided Video Super-Resolution (SeTe-VSR)<n>Our approach achieves a seamless balance between recovering intricate details and ensuring temporal coherence.<n>Our method not only preserves high-reality visual content but also significantly enhances fidelity.
arXiv Detail & Related papers (2025-08-01T09:47:35Z) - Latent Wavelet Diffusion For Ultra-High-Resolution Image Synthesis [56.311477476580926]
We present Latent Wavelet Diffusion (LWD), a lightweight training framework that significantly improves detail and texture fidelity in ultra-high-resolution (2K-4K) image synthesis.<n>LWD introduces a novel, frequency-aware masking strategy derived from wavelet energy maps, which dynamically focuses the training process on detail-rich regions of the latent space.
arXiv Detail & Related papers (2025-05-31T07:28:32Z) - DiVE: Efficient Multi-View Driving Scenes Generation Based on Video Diffusion Transformer [56.98400572837792]
DiVE produces high-fidelity, temporally coherent, and cross-view consistent multi-view videos.<n>These innovations collectively achieve a 2.62x speedup with minimal quality degradation.
arXiv Detail & Related papers (2025-04-28T09:20:50Z) - RAGME: Retrieval Augmented Video Generation for Enhanced Motion Realism [73.38167494118746]
We propose a framework to improve the realism of motion in generated videos.<n>We advocate for the incorporation of a retrieval mechanism during the generation phase.<n>Our pipeline is designed to apply to any text-to-video diffusion model.
arXiv Detail & Related papers (2025-04-09T08:14:05Z) - DiffVSR: Revealing an Effective Recipe for Taming Robust Video Super-Resolution Against Complex Degradations [25.756755602342942]
We present DiffVSR, featuring a Progressive Learning Strategy (PLS) that systematically decomposes this learning burden through staged training.<n>Our framework additionally incorporates an Interweaved Latent Transition (ILT) technique that maintains competitive temporal consistency without additional training overhead.
arXiv Detail & Related papers (2025-01-17T10:53:03Z) - STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.<n>These models struggle to maintain temporal consistency, as they are trained on static images.<n>We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - Towards Interpretable Video Super-Resolution via Alternating
Optimization [115.85296325037565]
We study a practical space-time video super-resolution (STVSR) problem which aims at generating a high-framerate high-resolution sharp video from a low-framerate blurry video.
We propose an interpretable STVSR framework by leveraging both model-based and learning-based methods.
arXiv Detail & Related papers (2022-07-21T21:34:05Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.