Helios: Real Real-Time Long Video Generation Model
- URL: http://arxiv.org/abs/2603.04379v1
- Date: Wed, 04 Mar 2026 18:45:21 GMT
- Title: Helios: Real Real-Time Long Video Generation Model
- Authors: Shenghai Yuan, Yuanyang Yin, Zongjian Li, Xinwei Huang, Xiao Yang, Li Yuan,
- Abstract summary: Helios is a 14B autoregressive diffusion model with a unified input representation that supports T2V, I2V, and V2V tasks.<n>Helios consistently outperforms prior methods on both short- and long-video generation.<n>We plan to release the code, base model, and distilled model to support further development by the community.
- Score: 33.34372252025333
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We introduce Helios, the first 14B video generation model that runs at 19.5 FPS on a single NVIDIA H100 GPU and supports minute-scale generation while matching the quality of a strong baseline. We make breakthroughs along three key dimensions: (1) robustness to long-video drifting without commonly used anti-drifting heuristics such as self-forcing, error-banks, or keyframe sampling; (2) real-time generation without standard acceleration techniques such as KV-cache, sparse/linear attention, or quantization; and (3) training without parallelism or sharding frameworks, enabling image-diffusion-scale batch sizes while fitting up to four 14B models within 80 GB of GPU memory. Specifically, Helios is a 14B autoregressive diffusion model with a unified input representation that natively supports T2V, I2V, and V2V tasks. To mitigate drifting in long-video generation, we characterize typical failure modes and propose simple yet effective training strategies that explicitly simulate drifting during training, while eliminating repetitive motion at its source. For efficiency, we heavily compress the historical and noisy context and reduce the number of sampling steps, yielding computational costs comparable to -- or lower than -- those of 1.3B video generative models. Moreover, we introduce infrastructure-level optimizations that accelerate both inference and training while reducing memory consumption. Extensive experiments demonstrate that Helios consistently outperforms prior methods on both short- and long-video generation. We plan to release the code, base model, and distilled model to support further development by the community.
Related papers
- DC-VideoGen: Efficient Video Generation with Deep Compression Video Autoencoder [55.26098043655325]
DC-VideoGen can be applied to any pre-trained video diffusion model.<n>It can be adapted to a deep compression latent space with lightweight fine-tuning.
arXiv Detail & Related papers (2025-09-29T17:59:31Z) - SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer [116.17385614259574]
We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration.<n>Two core designs ensure our efficient, effective and long video generation.<n>Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models.
arXiv Detail & Related papers (2025-09-29T12:28:09Z) - PUSA V1.0: Surpassing Wan-I2V with $500 Training Cost by Vectorized Timestep Adaptation [18.2095668161519]
Pusa is a groundbreaking paradigm that enables fine-grained temporal control within a unified video diffusion framework.<n>We set a new standard for image-to-video (I2V) generation, achieving a VBench-I2V total score of 87.32%.<n>This work establishes a scalable, efficient, and versatile paradigm for next-generation video synthesis.
arXiv Detail & Related papers (2025-07-22T00:09:37Z) - VideoMAR: Autoregressive Video Generatio with Continuous Tokens [33.906543515428424]
Masked-based autoregressive models have demonstrated promising image generation capability in continuous space.<n>We propose textbfVideoMAR, a decoder-only autoregressive image-to-video model with continuous tokens.<n>VideoMAR surpasses the previous state-of-the-art (Cosmos I2V) while requiring significantly fewer parameters.
arXiv Detail & Related papers (2025-06-17T04:08:18Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - From Slow Bidirectional to Fast Autoregressive Video Diffusion Models [48.35054927704544]
Current video diffusion models achieve impressive generation quality but struggle in interactive applications due to bidirectional attention dependencies.<n>We address this limitation by adapting a pretrained bidirectional diffusion transformer to an autoregressive transformer that generates frames on-the-fly.<n>Our model achieves a total score of 84.27 on the VBench-Long benchmark, surpassing all previous video generation models.
arXiv Detail & Related papers (2024-12-10T18:59:50Z) - REDUCIO! Generating 1K Video within 16 Seconds using Extremely Compressed Motion Latents [110.41795676048835]
One crucial obstacle for large-scale applications is the expensive training and inference cost.<n>We argue that videos contain significantly more redundant information than images, allowing them to be encoded with very few motion latents.<n>We design an image-conditioned VAE that projects videos into extremely compressed latent space and decode them based on content images.
arXiv Detail & Related papers (2024-11-20T18:59:52Z) - SimDA: Simple Diffusion Adapter for Efficient Video Generation [102.90154301044095]
We propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model, adapting it to video generation in a parameter-efficient way.
In addition to T2V generation in the wild, SimDA could also be utilized in one-shot video editing with only 2 minutes tuning.
arXiv Detail & Related papers (2023-08-18T17:58:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.