Related papers: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$

Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$

URL: http://arxiv.org/abs/2512.13492v1
Date: Mon, 15 Dec 2025 16:25:39 GMT
Title: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$
Authors: Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, Chengjie Wang,
Abstract summary: Native 4K video generation remains a critical challenge due to the quadratic computational explosion of full-attention as resolution increases.<n>This paper proposes a novel Transformer retrofit strategy termed $textbfT3-Video$ that significantly reduces compute requirements by optimizing their forward logic.<n>Results on 4K-VBench show that $textbfT3-Video$ substantially outperforms existing approaches.
Score: 91.61519033897424
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video

Related papers

CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video [86.80231588752957]
We introduce a novel Cube-temporal autoregressive diffusion model that generates 4K-resolution 360 videos.<n>By decomposing videos into cubemap representations with six faces, CubeComposer autogressively synthesizes content in a well-planned order.<n> experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality.
arXiv Detail & Related papers (2026-03-04T17:06:56Z)
H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z)
PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation [18.2095668161519]
Pusa is a groundbreaking paradigm that enables fine-grained temporal control within a unified video diffusion framework.<n>We set a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32%.<n>This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis.
arXiv Detail & Related papers (2025-07-22T00:09:37Z)
Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers [29.130090574300635]
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their compute demands pose a major challenge for practical deployment.<n>We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target.
arXiv Detail & Related papers (2025-06-05T14:41:38Z)
Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z)
Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.<n>We provide a comprehensive analysis of 3D Attention in the context of video prediction.<n>The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z)
FlashVideo: A Framework for Swift Inference in Text-to-Video Generation [9.665089218030086]
This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation. FlashVideo reduces the time complexity of inference from $mathcalO(L2)$ to $mathcalO(L)$ for a sequence of length $L$, significantly accelerating inference speed. Our comprehensive experiments demonstrate that FlashVideo achieves a $times9.17$ improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
arXiv Detail & Related papers (2023-12-30T00:06:28Z)
PixArt-$\alpha$: Fast Training of Diffusion Transformer for Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators. It supports high-resolution image synthesis up to 1024px resolution with low training cost. Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z)
Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer. Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z)
Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive. Here we aim to reduce the costs of Transformers by searching for a more efficient variant. We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z)
VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification. VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.