Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$
- URL: http://arxiv.org/abs/2512.13492v1
- Date: Mon, 15 Dec 2025 16:25:39 GMT
- Title: Transform Trained Transformer: Accelerating Naive 4K Video Generation Over 10$\times$
- Authors: Jiangning Zhang, Junwei Zhu, Teng Hu, Yabiao Wang, Donghao Luo, Weijian Cao, Zhenye Gan, Xiaobin Hu, Zhucun Xue, Chengjie Wang,
- Abstract summary: Native 4K video generation remains a critical challenge due to the quadratic computational explosion of full-attention as resolution increases.<n>This paper proposes a novel Transformer retrofit strategy termed $textbfT3-Video$ that significantly reduces compute requirements by optimizing their forward logic.<n>Results on 4K-VBench show that $textbfT3-Video$ substantially outperforms existing approaches.
- Score: 91.61519033897424
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Native 4K (2160$\times$3840) video generation remains a critical challenge due to the quadratic computational explosion of full-attention as spatiotemporal resolution increases, making it difficult for models to strike a balance between efficiency and quality. This paper proposes a novel Transformer retrofit strategy termed $\textbf{T3}$ ($\textbf{T}$ransform $\textbf{T}$rained $\textbf{T}$ransformer) that, without altering the core architecture of full-attention pretrained models, significantly reduces compute requirements by optimizing their forward logic. Specifically, $\textbf{T3-Video}$ introduces a multi-scale weight-sharing window attention mechanism and, via hierarchical blocking together with an axis-preserving full-attention design, can effect an "attention pattern" transformation of a pretrained model using only modest compute and data. Results on 4K-VBench show that $\textbf{T3-Video}$ substantially outperforms existing approaches: while delivering performance improvements (+4.29$\uparrow$ VQA and +0.08$\uparrow$ VTC), it accelerates native 4K video generation by more than 10$\times$. Project page at https://zhangzjn.github.io/projects/T3-Video
Related papers
- CubeComposer: Spatio-Temporal Autoregressive 4K 360° Video Generation from Perspective Video [86.80231588752957]
We introduce a novel Cube-temporal autoregressive diffusion model that generates 4K-resolution 360 videos.<n>By decomposing videos into cubemap representations with six faces, CubeComposer autogressively synthesizes content in a well-planned order.<n> experiments on benchmark datasets demonstrate that CubeComposer outperforms state-of-the-art methods in native resolution and visual quality.
arXiv Detail & Related papers (2026-03-04T17:06:56Z) - H$_{2}$OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers [124.11648300910444]
We present a hierarchical plug-and-play pruning-and-$-recovering framework, called Hierarchical Hourglass Tokenizer (H$_2$OT)<n>Our method is general-purpose: it can be easily incorporated into common VPT models on both seq2seq and seq2frame pipelines.
arXiv Detail & Related papers (2025-09-08T17:59:59Z) - PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation [18.2095668161519]
Pusa is a groundbreaking paradigm that enables fine-grained temporal control within a unified video diffusion framework.<n>We set a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32%.<n>This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis.
arXiv Detail & Related papers (2025-07-22T00:09:37Z) - Astraea: A Token-wise Acceleration Framework for Video Diffusion Transformers [29.130090574300635]
Video diffusion transformers (vDiTs) have made tremendous progress in text-to-video generation, but their compute demands pose a major challenge for practical deployment.<n>We introduce Astraea, a framework that searches for near-optimal configurations for vDiT-based video generation under a performance target.
arXiv Detail & Related papers (2025-06-05T14:41:38Z) - Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity [59.80405282381126]
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability.<n>We propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency.<n>SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality.
arXiv Detail & Related papers (2025-02-03T19:29:16Z) - Video Prediction Transformers without Recurrence or Convolution [65.93130697098658]
We propose PredFormer, a framework entirely based on Gated Transformers.<n>We provide a comprehensive analysis of 3D Attention in the context of video prediction.<n>The significant improvements in both accuracy and efficiency highlight the potential of PredFormer.
arXiv Detail & Related papers (2024-10-07T03:52:06Z) - FlashVideo: A Framework for Swift Inference in Text-to-Video Generation [9.665089218030086]
This paper introduces FlashVideo, a novel framework tailored for swift Text-to-Video generation.
FlashVideo reduces the time complexity of inference from $mathcalO(L2)$ to $mathcalO(L)$ for a sequence of length $L$, significantly accelerating inference speed.
Our comprehensive experiments demonstrate that FlashVideo achieves a $times9.17$ improvement over a traditional autoregressive-based transformer model, and its inference speed is of the same order of magnitude as that of BERT-based transformer models.
arXiv Detail & Related papers (2023-12-30T00:06:28Z) - PixArt-$\alpha$: Fast Training of Diffusion Transformer for
Photorealistic Text-to-Image Synthesis [108.83343447275206]
This paper introduces PIXART-$alpha$, a Transformer-based T2I diffusion model whose image generation quality is competitive with state-of-the-art image generators.
It supports high-resolution image synthesis up to 1024px resolution with low training cost.
Tests demonstrate that PIXART-$alpha$ excels in image quality, artistry, and semantic control.
arXiv Detail & Related papers (2023-09-30T16:18:00Z) - Vcc: Scaling Transformers to 128K Tokens or More by Prioritizing
Important Tokens [65.4435926060951]
We propose to significantly improve the efficiency of Transformers for ultra long sequences, by compressing the sequence into a much smaller representation at each layer.
Our algorithm is not only efficient (achieving more than $3times$ efficiency gain compared to baselines on 4K and 16K lengths) but also offers competitive/better performance on a large number of tasks.
arXiv Detail & Related papers (2023-05-07T10:32:18Z) - Primer: Searching for Efficient Transformers for Language Modeling [79.2677566332444]
Training and inference costs of large Transformer models have grown rapidly and become expensive.
Here we aim to reduce the costs of Transformers by searching for a more efficient variant.
We identify an architecture, named Primer, that has a smaller training cost than the original Transformer.
arXiv Detail & Related papers (2021-09-17T17:50:39Z) - VidTr: Video Transformer Without Convolutions [32.710988574799735]
We introduce Video Transformer (VidTr) with separable-attention fortemporal video classification.
VidTr is able to aggregate-temporal information via stacked attentions and provide better performance with higher efficiency.
arXiv Detail & Related papers (2021-04-23T17:59:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.