Related papers: LongLive: Real-time Interactive Long Video Generation

LongLive: Real-time Interactive Long Video Generation

URL: http://arxiv.org/abs/2509.22622v1
Date: Fri, 26 Sep 2025 17:48:24 GMT
Title: LongLive: Real-time Interactive Long Video Generation
Authors: Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, Song Han, Yukang Chen,
Abstract summary: LongLive is a frame-level autoregressive framework for real-time and interactive long video generation.<n>LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos.
Score: 68.45945318075432
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present LongLive, a frame-level autoregressive (AR) framework for real-time and interactive long video generation. Long video generation presents challenges in both efficiency and quality. Diffusion and Diffusion-Forcing models can produce high-quality videos but suffer from low efficiency due to bidirectional attention. Causal attention AR models support KV caching for faster inference, but often degrade in quality on long videos due to memory challenges during long-video training. In addition, beyond static prompt-based generation, interactive capabilities, such as streaming prompt inputs, are critical for dynamic content creation, enabling users to guide narratives in real time. This interactive requirement significantly increases complexity, especially in ensuring visual consistency and semantic coherence during prompt transitions. To address these challenges, LongLive adopts a causal, frame-level AR design that integrates a KV-recache mechanism that refreshes cached states with new prompts for smooth, adherent switches; streaming long tuning to enable long video training and to align training and inference (train-long-test-long); and short window attention paired with a frame-level attention sink, shorten as frame sink, preserving long-range consistency while enabling faster generation. With these key designs, LongLive fine-tunes a 1.3B-parameter short-clip model to minute-long generation in just 32 GPU-days. At inference, LongLive sustains 20.7 FPS on a single NVIDIA H100, achieves strong performance on VBench in both short and long videos. LongLive supports up to 240-second videos on a single H100 GPU. LongLive further supports INT8-quantized inference with only marginal quality loss.

Related papers

LongCat-Video Technical Report [40.35352541782164]
LongCat-Video is a foundational video generation model with 13.6B parameters.<n>It supports Text-to-Video, Image-to-Video, and Video-Continuation tasks with a single model.<n>LongCat-Video generates 720p, 30fps videos within minutes by employing a coarse-to-fine generation strategy.
arXiv Detail & Related papers (2025-10-25T07:41:02Z)
FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion [24.48220892418698]
FreeLong is a training-free framework designed to balance the frequency distribution of long video features during the denoising process.<n>FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows.<n>FreeLong++ extends FreeLong into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale.
arXiv Detail & Related papers (2025-06-30T18:11:21Z)
Flash-VStream: Efficient Real-Time Understanding for Long Video Streams [64.25549822010372]
Flash-VStream is a video language model capable of processing extremely long videos and responding to user queries in real time.<n>Compared to existing models, Flash-VStream achieves significant reductions in inference latency.
arXiv Detail & Related papers (2025-06-30T13:17:49Z)
LiveVLM: Efficient Online Video Understanding via Streaming-Oriented KV Cache and Retrieval [13.891391928767195]
LiveVLM is a training-free framework specifically designed for streaming, online video understanding and real-time interaction.<n>LiveVLM constructs a streaming-oriented KV cache to process video streams in real-time, retain long-term video details and eliminate redundant KVs.<n>When a new question is proposed, LiveVLM incorporates an online question-answering process that efficiently fetches both short-term and long-term visual information.
arXiv Detail & Related papers (2025-05-21T08:47:15Z)
$\infty$-Video: A Training-Free Approach to Long Video Understanding via Continuous-Time Memory Consolidation [19.616624959353697]
$infty$-Video can process arbitrarily long videos through a continuous-time long-term memory (LTM) consolidation mechanism.<n>Our framework augments video Q-formers by allowing them to process video contexts efficiently and without requiring additional training.
arXiv Detail & Related papers (2025-01-31T12:45:46Z)
VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling [43.485687038460895]
Long-context video modeling is critical for multimodal large language models (MLLMs)<n>This paper aims to address this issue from aspects of model architecture, training data, training strategy and evaluation benchmark.<n>We build a powerful video MLLM named VideoChat-Flash, which shows a leading performance on both mainstream long and short video benchmarks.
arXiv Detail & Related papers (2024-12-31T18:01:23Z)
RAIN: Real-time Animation of Infinite Video Stream [52.97171098038888]
RAIN is a pipeline solution capable of animating infinite video streams in real-time with low latency.<n>RAIN generates video frames with much shorter latency and faster speed, while maintaining long-range attention over extended video streams.<n>RAIN can animate characters in real-time with much better quality, accuracy, and consistency than competitors.
arXiv Detail & Related papers (2024-12-27T07:13:15Z)
FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation. We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation. Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z)
StreamingT2V: Consistent, Dynamic, and Extendable Long Video Generation from Text [58.49820807662246]
We introduce StreamingT2V, an autoregressive approach for long video generation of 80, 240, 600, 1200 or more frames with smooth transitions.<n>Our code will be available at: https://github.com/Picsart-AI-Research/StreamingT2V.
arXiv Detail & Related papers (2024-03-21T18:27:29Z)
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding [70.7882058229772]
This paper tackles an emerging and challenging problem of long video temporal grounding(VTG) Compared with short videos, long videos are also highly demanded but less explored. We propose CONE, an efficient COarse-to-fiNE alignment framework.
arXiv Detail & Related papers (2022-09-22T10:58:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.