STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows
- URL: http://arxiv.org/abs/2511.20462v2
- Date: Wed, 26 Nov 2025 03:06:18 GMT
- Title: STARFlow-V: End-to-End Video Generative Modeling with Normalizing Flows
- Authors: Jiatao Gu, Ying Shen, Tianrong Chen, Laurent Dinh, Yuyang Wang, Miguel Angel Bautista, David Berthelot, Josh Susskind, Shuangfei Zhai,
- Abstract summary: STARFlow-V is a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation.<n>Results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation.
- Score: 35.05757953878183
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Normalizing flows (NFs) are end-to-end likelihood-based generative models for continuous data, and have recently regained attention with encouraging progress on image generation. Yet in the video generation domain, where spatiotemporal complexity and computational cost are substantially higher, state-of-the-art systems almost exclusively rely on diffusion-based models. In this work, we revisit this design space by presenting STARFlow-V, a normalizing flow-based video generator with substantial benefits such as end-to-end learning, robust causal prediction, and native likelihood estimation. Building upon the recently proposed STARFlow, STARFlow-V operates in the spatiotemporal latent space with a global-local architecture which restricts causal dependencies to a global latent space while preserving rich local within-frame interactions. This eases error accumulation over time, a common pitfall of standard autoregressive diffusion model generation. Additionally, we propose flow-score matching, which equips the model with a light-weight causal denoiser to improve the video generation consistency in an autoregressive fashion. To improve the sampling efficiency, STARFlow-V employs a video-aware Jacobi iteration scheme that recasts inner updates as parallelizable iterations without breaking causality. Thanks to the invertible structure, the same model can natively support text-to-video, image-to-video as well as video-to-video generation tasks. Empirically, STARFlow-V achieves strong visual fidelity and temporal consistency with practical sampling throughput relative to diffusion-based baselines. These results present the first evidence, to our knowledge, that NFs are capable of high-quality autoregressive video generation, establishing them as a promising research direction for building world models. Code and generated samples are available at https://github.com/apple/ml-starflow.
Related papers
- Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z) - Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation [41.993197533574126]
Inferix is an inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes.<n>Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation.
arXiv Detail & Related papers (2025-11-25T01:45:04Z) - Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z) - ARSS: Taming Decoder-only Autoregressive Visual Generation for View Synthesis From Single View [11.346049532150127]
textbfARSS is a framework that generates novel views from a single image conditioned on a camera trajectory.<n>Our method performs comparably to, or better than, state-of-the-art view synthesis approaches based on diffusion models.
arXiv Detail & Related papers (2025-09-27T00:03:09Z) - Taming generative video models for zero-shot optical flow extraction [28.176290134216995]
Self-supervised video models trained only for future frame prediction can be prompted, without fine-tuning, to output flow.<n>Inspired by the Counterfactual World Model (CWM) paradigm, we extend this idea to generative video models.<n> KL-tracing is a novel test-time procedure that injects a localized perturbation into the first frame, rolls out the model one step, and computes the Kullback-Leibler divergence between perturbed and ungenerative predictive distributions.
arXiv Detail & Related papers (2025-07-11T23:59:38Z) - Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion [67.94300151774085]
We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models.<n>It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs.
arXiv Detail & Related papers (2025-06-09T17:59:55Z) - STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis [44.2114053357308]
We present a scalable generative model based on normalizing flows that achieves strong performance in high-resolution image synthesis.<n>The core of STARFlow is Transformer Autoregressive Flow (TARFlow), which combines the expressive power of normalizing flows with the structured modeling capabilities of Autoregressive Transformers.
arXiv Detail & Related papers (2025-06-06T17:58:39Z) - Temporal Regularization Makes Your Video Generator Stronger [34.33572297364156]
Temporal quality is a critical aspect of video generation, as it ensures consistent motion and realistic dynamics across frames.<n>We introduce temporal augmentation in video generation for the first time, and introduce FluxFlow for initial investigation.<n>Experiments on UCF-101 and VBench benchmarks demonstrate that FluxFlow significantly improves temporal coherence and diversity across various video generation models.
arXiv Detail & Related papers (2025-03-19T16:59:32Z) - Improving Video Generation with Human Feedback [105.81833319891537]
We develop a systematic pipeline that harnesses human feedback to mitigate video generation problems.<n>We introduce VideoReward, a multi-dimensional video reward model, and examine how annotations and various design choices impact its rewarding efficacy.
arXiv Detail & Related papers (2025-01-23T18:55:41Z) - Jet: A Modern Transformer-Based Normalizing Flow [62.2573739835562]
We revisit the design of the coupling-based normalizing flow models by carefully ablating prior design choices.<n>We achieve state-of-the-art quantitative and qualitative performance with a much simpler architecture.
arXiv Detail & Related papers (2024-12-19T18:09:42Z) - Autoregressive Video Generation without Vector Quantization [90.87907377618747]
We reformulate the video generation problem as a non-quantized autoregressive modeling of temporal frame-by-frame prediction.<n>With the proposed approach, we train a novel video autoregressive model without vector quantization, termed NOVA.<n>Our results demonstrate that NOVA surpasses prior autoregressive video models in data efficiency, inference speed, visual fidelity, and video fluency, even with a much smaller model capacity.
arXiv Detail & Related papers (2024-12-18T18:59:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.