BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
- URL: http://arxiv.org/abs/2511.22973v1
- Date: Fri, 28 Nov 2025 08:25:59 GMT
- Title: BlockVid: Block Diffusion for High-Quality and Consistent Minute-Long Video Generation
- Authors: Zeyu Zhang, Shuning Chang, Yuanyu He, Yizeng Han, Jiasheng Tang, Fan Wang, Bohan Zhuang,
- Abstract summary: BlockVid is a novel block diffusion framework equipped with semantic-aware sparse KV cache.<n> LV-Bench is a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence.
- Score: 44.45173635133032
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Generating minute-long videos is a critical step toward developing world models, providing a foundation for realistic extended scenes and advanced AI simulators. The emerging semi-autoregressive (block diffusion) paradigm integrates the strengths of diffusion and autoregressive models, enabling arbitrary-length video generation and improving inference efficiency through KV caching and parallel sampling. However, it yet faces two enduring challenges: (i) KV-cache-induced long-horizon error accumulation, and (ii) the lack of fine-grained long-video benchmarks and coherence-aware metrics. To overcome these limitations, we propose BlockVid, a novel block diffusion framework equipped with semantic-aware sparse KV cache, an effective training strategy called Block Forcing, and dedicated chunk-wise noise scheduling and shuffling to reduce error propagation and enhance temporal consistency. We further introduce LV-Bench, a fine-grained benchmark for minute-long videos, complete with new metrics evaluating long-range coherence. Extensive experiments on VBench and LV-Bench demonstrate that BlockVid consistently outperforms existing methods in generating high-quality, coherent minute-long videos. In particular, it achieves a 22.2% improvement on VDE Subject and a 19.4% improvement on VDE Clarity in LV-Bench over the state of the art approaches. Project website: https://ziplab.co/BlockVid. Inferix (Code): https://github.com/alibaba-damo-academy/Inferix.
Related papers
- FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion [51.1618564189244]
FlashBlock is a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process.<n> Experiments on diffusion language models and video generation demonstrate up to 1.44$times$ higher token throughput and up to 1.6$times$ reduction in attention time, with negligible impact on generation quality.
arXiv Detail & Related papers (2026-02-05T04:57:21Z) - Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - VDOT: Efficient Unified Video Creation via Optimal Transport Distillation [70.02065520468726]
We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
arXiv Detail & Related papers (2025-12-07T11:31:00Z) - Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z) - Efficient Video Sampling: Pruning Temporally Redundant Tokens for Faster VLM Inference [5.146388234814547]
Long videos often exceed the token budget of modern language models, leading to severe context limitations and latency issues.<n>We introduce Efficient Video Sampling (EVS), a simple, plug-and-play method for reducing token redundancy in videos by identifying and pruning temporally static patches.<n>EVS substantially reduces token count while maintaining semantic fidelity, enabling faster inference and longer input sequences.
arXiv Detail & Related papers (2025-10-16T12:34:38Z) - InfVSR: Breaking Length Limits of Generic Video Super-Resolution [40.30527504651693]
InfVSR is an autoregressive-one-step-diffusion paradigm for long sequences.<n>We distill the diffusion process into a single step efficiently, with patch-wise pixel supervision and cross-chunk distribution matching.<n>Our method pushes the frontier of long-form VSR, achieves state-of-the-art quality with enhanced semantic consistency, and delivers up to 58x speed-up over existing methods.
arXiv Detail & Related papers (2025-10-01T14:21:45Z) - LongScape: Advancing Long-Horizon Embodied World Models with Context-Aware MoE [16.561410415129778]
LongScape is a hybrid framework that combines intra-chunk diffusion denoising with inter-chunk autoregressive causal generation.<n>Our core innovation is an action-guided, variable-length chunking mechanism that partitions video based on the semantic context of robotic actions.
arXiv Detail & Related papers (2025-09-26T02:47:05Z) - BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching [6.354675628412448]
Block-Wise Caching (BWCache) is a training-free method to accelerate DiT-based video generation.<n> experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$times$ speedup with comparable visual quality.
arXiv Detail & Related papers (2025-09-17T07:58:36Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding [51.711605076319216]
Diffusion-based large language models (Diffusion LLMs) have shown promise for non-autoregressive text generation with parallel decoding capabilities.<n>We introduce a novel block-wise approximate KV Cache mechanism tailored for bidirectional diffusion models, enabling cache reuse with negligible performance drop.<n>We propose a confidence-aware parallel decoding strategy that selectively decodes tokens exceeding a confidence threshold, mitigating dependency violations and maintaining generation quality.
arXiv Detail & Related papers (2025-05-28T17:39:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.