Mode Seeking meets Mean Seeking for Fast Long Video Generation
- URL: http://arxiv.org/abs/2602.24289v1
- Date: Fri, 27 Feb 2026 18:59:02 GMT
- Title: Mode Seeking meets Mean Seeking for Fast Long Video Generation
- Authors: Shengqu Cai, Weili Nie, Chao Liu, Julius Berner, Lvmin Zhang, Nanye Ma, Hansheng Chen, Maneesh Agrawala, Leonidas Guibas, Gordon Wetzstein, Arash Vahdat,
- Abstract summary: Scaling video generation from seconds to minutes faces a critical bottleneck.<n>We propose a training paradigm where Mode Seeking meets Mean Seeking.<n>Our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency.
- Score: 79.62764340469
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Scaling video generation from seconds to minutes faces a critical bottleneck: while short-video data is abundant and high-fidelity, coherent long-form data is scarce and limited to narrow domains. To address this, we propose a training paradigm where Mode Seeking meets Mean Seeking, decoupling local fidelity from long-term coherence based on a unified representation via a Decoupled Diffusion Transformer. Our approach utilizes a global Flow Matching head trained via supervised learning on long videos to capture narrative structure, while simultaneously employing a local Distribution Matching head that aligns sliding windows to a frozen short-video teacher via a mode-seeking reverse-KL divergence. This strategy enables the synthesis of minute-scale videos that learns long-range coherence and motions from limited long videos via supervised flow matching, while inheriting local realism by aligning every sliding-window segment of the student to a frozen short-video teacher, resulting in a few-step fast long video generator. Evaluations show that our method effectively closes the fidelity-horizon gap by jointly improving local sharpness, motion and long-range consistency. Project website: https://primecai.github.io/mmm/.
Related papers
- MotionStream: Real-Time Video Generation with Interactive Motion Controls [60.403597895657505]
We present MotionStream, enabling sub-second latency with up to 29 FPS streaming generation on a single GPU.<n>Our approach begins by augmenting a text-to-video model with motion control, which generates high-quality videos that adhere to the global text prompt and local motion guidance, but does not perform inference on the fly.<n>Our models achieve state-of-the-art results in motion following and video quality while being two orders of magnitude faster, uniquely enabling infinite-length streaming.
arXiv Detail & Related papers (2025-11-03T06:37:53Z) - FreeLong++: Training-Free Long Video Generation via Multi-band SpectralFusion [24.48220892418698]
FreeLong is a training-free framework designed to balance the frequency distribution of long video features during the denoising process.<n>FreeLong achieves this by blending global low-frequency features, which capture holistic semantics across the full video, with local high-frequency features extracted from short temporal windows.<n>FreeLong++ extends FreeLong into a multi-branch architecture with multiple attention branches, each operating at a distinct temporal scale.
arXiv Detail & Related papers (2025-06-30T18:11:21Z) - LongDWM: Cross-Granularity Distillation for Building a Long-Term Driving World Model [22.92353994818742]
Driving world models are used to simulate futures by video generation based on the condition of the current state and actions.<n>Recent studies utilize the Diffusion Transformer (DiT) as the backbone of driving world models to improve learning flexibility.<n>We propose several solutions to build a simple yet effective long-term driving world model.
arXiv Detail & Related papers (2025-06-02T11:19:23Z) - Multi-Scale Contrastive Learning for Video Temporal Grounding [42.180296672043404]
Temporal grounding, which localizes video moments related to a natural language query, is a core problem of vision-language learning and video understanding.<n>We propose a contrastive learning framework to capture salient semantics among video moments.
arXiv Detail & Related papers (2024-12-10T03:34:56Z) - SlowFast-VGen: Slow-Fast Learning for Action-Driven Long Video Generation [153.46240555355408]
SlowFast-VGen is a novel dual-speed learning system for action-driven long video generation.
Our approach incorporates a conditional video diffusion model for the slow learning of world dynamics.
We propose a slow-fast learning loop algorithm that seamlessly integrates the inner fast learning loop into the outer slow learning loop.
arXiv Detail & Related papers (2024-10-30T17:55:52Z) - FreeLong: Training-Free Long Video Generation with SpectralBlend Temporal Attention [57.651429116402554]
This paper investigates a straightforward and training-free approach to extend an existing short video diffusion model for consistent long video generation.
We find that directly applying the short video diffusion model to generate long videos can lead to severe video quality degradation.
Motivated by this, we propose a novel solution named FreeLong to balance the frequency distribution of long video features during the denoising process.
arXiv Detail & Related papers (2024-07-29T11:52:07Z) - Anchored Diffusion for Video Face Reenactment [17.343307538702238]
We introduce Anchored Diffusion, a novel method for synthesizing relatively long and seamless videos.
We train our model on video sequences with random non-uniform temporal spacing and incorporate temporal information via external guidance.
During inference, we leverage the transformer architecture to modify the diffusion process, generating a batch of non-uniform sequences anchored to a common frame.
arXiv Detail & Related papers (2024-07-21T13:14:17Z) - Revisiting Kernel Temporal Segmentation as an Adaptive Tokenizer for
Long-form Video Understanding [57.917616284917756]
Real-world videos are often several minutes long with semantically consistent segments of variable length.
A common approach to process long videos is applying a short-form video model over uniformly sampled clips of fixed temporal length.
This approach neglects the underlying nature of long videos since fixed-length clips are often redundant or uninformative.
arXiv Detail & Related papers (2023-09-20T18:13:32Z) - Long Short-Term Relation Networks for Video Action Detection [155.13392337831166]
Long Short-Term Relation Networks (LSTR) are presented in this paper.
LSTR aggregates and propagates relation to augment features for video action detection.
Extensive experiments are conducted on four benchmark datasets.
arXiv Detail & Related papers (2020-03-31T10:02:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.