GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
- URL: http://arxiv.org/abs/2512.03451v1
- Date: Wed, 03 Dec 2025 05:08:18 GMT
- Title: GalaxyDiT: Efficient Video Generation with Guidance Alignment and Adaptive Proxy in Diffusion Transformers
- Authors: Zhiye Song, Steve Dai, Ben Keller, Brucek Khailany,
- Abstract summary: GalaxyDiT is a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics.<n>We achieve $1.87times$ and $2.37times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark.<n>At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR)
- Score: 5.2424169748898555
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Diffusion models have revolutionized video generation, becoming essential tools in creative content generation and physical simulation. Transformer-based architectures (DiTs) and classifier-free guidance (CFG) are two cornerstones of this success, enabling strong prompt adherence and realistic video quality. Despite their versatility and superior performance, these models require intensive computation. Each video generation requires dozens of iterative steps, and CFG doubles the required compute. This inefficiency hinders broader adoption in downstream applications. We introduce GalaxyDiT, a training-free method to accelerate video generation with guidance alignment and systematic proxy selection for reuse metrics. Through rank-order correlation analysis, our technique identifies the optimal proxy for each video model, across model families and parameter scales, thereby ensuring optimal computational reuse. We achieve $1.87\times$ and $2.37\times$ speedup on Wan2.1-1.3B and Wan2.1-14B with only 0.97% and 0.72% drops on the VBench-2.0 benchmark. At high speedup rates, our approach maintains superior fidelity to the base model, exceeding prior state-of-the-art approaches by 5 to 10 dB in peak signal-to-noise ratio (PSNR).
Related papers
- LESA: Learnable Stage-Aware Predictors for Diffusion Model Acceleration [12.183601881545039]
Diffusion models have achieved remarkable success in image and video generation tasks.<n>However, the high computational demands of Diffusion Transformers pose a significant challenge to their practical deployment.<n>We propose a LEarnable Stage-Aware (LESA) predictor framework based on two-stage training.
arXiv Detail & Related papers (2026-02-24T02:53:28Z) - VDOT: Efficient Unified Video Creation via Optimal Transport Distillation [70.02065520468726]
We propose an efficient unified video creation model, named VDOT.<n>We employ a novel computational optimal transport (OT) technique to optimize the discrepancy between the real and fake score distributions.<n>To support training unified video creation models, we propose a fully automated pipeline for video data annotation and filtering.
arXiv Detail & Related papers (2025-12-07T11:31:00Z) - StreamDiffusionV2: A Streaming System for Dynamic and Interactive Video Generation [65.90400162290057]
Generative models are reshaping the live-streaming industry by redefining how content is created, styled, and delivered.<n>Recent advances in video diffusion have markedly improved temporal consistency and sampling efficiency for offline generation.<n>Live online streaming operates under strict service-level objectives (SLOs): time-to-first-frame must be minimal, and every frame must meet a per-frame deadline with low jitter.
arXiv Detail & Related papers (2025-11-10T18:51:28Z) - Large Scale Diffusion Distillation via Score-Regularized Continuous-Time Consistency [60.74505433956616]
continuous-time consistency model (sCM) is theoretically principled and empirically powerful for accelerating academic-scale diffusion.<n>We first develop a parallelism-compatible FlashAttention-2 JVP kernel, enabling sCM training on models with over 10 billion parameters and high-dimensional video tasks.<n>We propose the score-regularized continuous-time consistency model (rCM), which incorporates score distillation as a long-skip regularizer.
arXiv Detail & Related papers (2025-10-09T16:45:30Z) - BLADE: Block-Sparse Attention Meets Step Distillation for Efficient Video Generation [27.57431718095974]
We present BLADE, a data-free joint training framework for video inference.<n>We show remarkable efficiency gains across different scales.<n>On models such as CogVideoX-5B with short video sequence lengths, our framework delivers a robust 8.89x speedup.
arXiv Detail & Related papers (2025-08-14T15:58:59Z) - Seedance 1.0: Exploring the Boundaries of Video Generation Models [71.26796999246068]
Seedance 1.0 is a high-performance and inference-efficient video foundation generation model.<n>It integrates multi-source curation data augmented with precision and meaningful video captioning.<n>Seedance 1.0 can generate a 5-second video at 1080p resolution only with 41.4 seconds ( NVIDIA-L20)
arXiv Detail & Related papers (2025-06-10T17:56:11Z) - Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers [29.130090574300635]
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their compute demands pose a major challenge for practical deployment.<n>We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Vchitect-2.0: Parallel Transformer for Scaling Up Video Diffusion Models [89.79067761383855]
Vchitect-2.0 is a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation.<n>By introducing a novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated video frames.<n>To overcome memory and computational bottlenecks, we propose a Memory-efficient Training framework.
arXiv Detail & Related papers (2025-01-14T21:53:11Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.