Block Cascading: Training Free Acceleration of Block-Causal Video Models
- URL: http://arxiv.org/abs/2511.20426v1
- Date: Tue, 25 Nov 2025 15:52:58 GMT
- Title: Block Cascading: Training Free Acceleration of Block-Causal Video Models
- Authors: Hmrishav Bandyopadhyay, Nikhil Pinnaparaju, Rahim Entezari, Jim Scott, Yi-Zhe Song, Varun Jampani,
- Abstract summary: Small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS.<n>Block Cascading significantly mitigates this trade-off through training-free parallelization.<n>Our key insight: future video blocks do not need fully denoised current blocks to begin generation.
- Score: 87.49370566105999
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Block-causal video generation faces a stark speed-quality trade-off: small 1.3B models manage only 16 FPS while large 14B models crawl at 4.5 FPS, forcing users to choose between responsiveness and quality. Block Cascading significantly mitigates this trade-off through training-free parallelization. Our key insight: future video blocks do not need fully denoised current blocks to begin generation. By starting block generation with partially denoised context from predecessors, we transform sequential pipelines into parallel cascades where multiple blocks denoise simultaneously. With 5 GPUs exploiting temporal parallelism, we achieve ~2x acceleration across all model scales: 1.3B models accelerate from 16 to 30 FPS, 14B models from 4.5 to 12.5 FPS. Beyond inference speed, Block Cascading eliminates overhead from KV-recaching (of ~200ms) during context switches for interactive generation. Extensive evaluations validated against multiple block-causal pipelines demonstrate no significant loss in generation quality when switching from block-causal to Block Cascading pipelines for inference. Project Page: https://hmrishavbandy.github.io/block_cascading_page/
Related papers
- Helios: Real Real-Time Long Video Generation Model [33.34372252025333]
Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
arXiv Detail & Related papers (2026-03-04T18:45:21Z) - FlashBlock: Attention Caching for Efficient Long-Context Block Diffusion [51.1618564189244]
FlashBlock is a cached block-external attention mechanism that reuses stable attention output, reducing attention computation and KV cache access without modifying the diffusion process.<n> Experiments on diffusion language models and video generation demonstrate up to 1.44$times$ higher token throughput and up to 1.6$times$ reduction in attention time, with negligible impact on generation quality.
arXiv Detail & Related papers (2026-02-05T04:57:21Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - BWCache: Accelerating Video Diffusion Transformers through Block-Wise Caching [6.354675628412448]
Block-Wise Caching (BWCache) is a training-free method to accelerate DiT-based video generation.<n> experiments on several video diffusion models demonstrate that BWCache achieves up to 2.24$times$ speedup with comparable visual quality.
arXiv Detail & Related papers (2025-09-17T07:58:36Z) - Minute-Long Videos with Dual Parallelisms [57.22737565366549]
Diffusion Transformer (DiT)-based video diffusion models generate high-quality videos at scale but incur prohibitive processing latency and memory costs for long videos.<n>We propose a novel distributed inference strategy, termed DualParal.<n>Instead of generating an entire video on a single GPU, we parallelize both temporal frames and model layers across GPUs.
arXiv Detail & Related papers (2025-05-27T11:55:22Z) - Next Block Prediction: Video Generation via Semi-Autoregressive Modeling [92.60177942930946]
Next-Block Prediction (NBP) is a semi-autoregressive (semi-AR) framework for video generation.<n>NBP employs bidirectional attention within each block, enabling tokens to capture more robust spatial dependencies.<n>Our model achieves FVD scores of 103.3 on UCF101 and 25.5 on K600, outperforming the vanilla NTP model by an average of 4.4.
arXiv Detail & Related papers (2025-02-11T17:57:53Z) - Efficient Motion Modelling with Variable-sized blocks from Hierarchical
Cuboidal Partitioning [24.100530697346155]
Motion modelling with block-based architecture has been widely used in video coding where a frame is divided into fixed-sized blocks that are motion compensated independently.
We have investigated the potential of cuboids in motion modelling against the fixed-sized blocks used in scalable video coding.
arXiv Detail & Related papers (2022-08-28T04:13:58Z) - 1$\ imes$N Block Pattern for Network Sparsity [90.43191747596491]
We propose one novel concept of $1times N$ block sparsity pattern (block pruning) to break this limitation.
Our pattern obtains about 3.0% improvements over filter pruning in the top-1 accuracy of MobileNet-V2.
It also obtains 56.04ms inference savings on Cortex-A7 CPU over weight pruning.
arXiv Detail & Related papers (2021-05-31T05:50:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.