Related papers: MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation

MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation

URL: http://arxiv.org/abs/2502.11234v2
Date: Wed, 12 Mar 2025 16:27:37 GMT
Title: MaskFlow: Discrete Flows For Flexible and Efficient Long Video Generation
Authors: Michael Fuest, Vincent Tao Hu, Björn Ommer,
Abstract summary: We introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching.<n>By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences.<n>We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Frechet Video Distance (FVD) competitive with state-of-the-art approaches.
Score: 25.721829124345106
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Generating long, high-quality videos remains a challenge due to the complex interplay of spatial and temporal dynamics and hardware limitations. In this work, we introduce MaskFlow, a unified video generation framework that combines discrete representations with flow-matching to enable efficient generation of high-quality long videos. By leveraging a frame-level masking strategy during training, MaskFlow conditions on previously generated unmasked frames to generate videos with lengths ten times beyond that of the training sequences. MaskFlow does so very efficiently by enabling the use of fast Masked Generative Model (MGM)-style sampling and can be deployed in both fully autoregressive as well as full-sequence generation modes. We validate the quality of our method on the FaceForensics (FFS) and Deepmind Lab (DMLab) datasets and report Frechet Video Distance (FVD) competitive with state-of-the-art approaches. We also provide a detailed analysis on the sampling efficiency of our method and demonstrate that MaskFlow can be applied to both timestep-dependent and timestep-independent models in a training-free manner.

Related papers

Towards Holistic Modeling for Video Frame Interpolation with Auto-regressive Diffusion Transformers [95.68243351895107]
We propose a holistic, video-centric paradigm named textbfLocal textbfDiffusion textbfForcing for textbfVideo textbfFrame textbfInterpolation (LDF-VFI)<n>Our framework is built upon an auto-regressive diffusion transformer that models the entire video sequence to ensure long-range temporal coherence.<n>LDF-VFI achieves state-of-the-art performance on challenging long-sequence benchmarks, demonstrating superior per
arXiv Detail & Related papers (2026-01-21T12:58:52Z)
Sparse-LaViDa: Sparse Multimodal Discrete Diffusion Language Models [63.50827603618498]
We propose Sparse-LaViDa, a modeling framework that truncates unnecessary masked tokens at each inference step to accelerate MDM sampling.<n>Built upon the state-of-the-art unified MDM LaViDa-O, Sparse-LaViDa achieves up to a 2x speedup across diverse tasks.
arXiv Detail & Related papers (2025-12-16T02:06:06Z)
Uniform Discrete Diffusion with Metric Path for Video Generation [103.86033350602908]
Continuous-space video generation has advanced rapidly, while discrete approaches lag behind due to error accumulation and long-duration inconsistency.<n>We present Uniform generative modeling and present Uniform pAth (URSA), a powerful framework that bridges the gap with continuous approaches for scalable video generation.<n>URSA consistently outperforms existing discrete methods and achieves performance comparable to state-of-the-art continuous diffusion methods.
arXiv Detail & Related papers (2025-10-28T17:59:57Z)
Resource-Efficient Motion Control for Video Generation via Dynamic Mask Guidance [2.5941932242768457]
Mask-guided video generation can control video generation through mask motion sequences. Our model enhances existing architectures by incorporating foreground masks for precise text-position matching and motion trajectory control. This approach excels in various video generation tasks, such as video editing and generating artistic videos, outperforming previous methods in terms of consistency and quality.
arXiv Detail & Related papers (2025-03-24T06:53:08Z)
Token-Efficient Long Video Understanding for Multimodal LLMs [101.70681093383365]
STORM is a novel architecture incorporating a dedicated temporal encoder between the image encoder and the Video-LLMs. We show that STORM achieves state-of-the-art results across various long video understanding benchmarks.
arXiv Detail & Related papers (2025-03-06T06:17:38Z)
Taming Teacher Forcing for Masked Autoregressive Video Generation [63.477471494341955]
We introduce MAGI, a hybrid video generation framework that combines masked modeling for intra-frame generation with causal modeling for next-frame generation. Our key innovation, Complete Teacher Forcing (CTF), conditions masked frames on complete observation frames rather than masked ones. CTF significantly outperforms MTF, achieving a +23% improvement in FVD scores on first-frame conditioned video prediction. Experiments show that MAGI can generate long, coherent video sequences exceeding 100 frames, even when trained on as few as 16 frames, highlighting its potential for scalable, high-quality video generation.
arXiv Detail & Related papers (2025-01-21T18:59:31Z)
Optical-Flow Guided Prompt Optimization for Coherent Video Generation [51.430833518070145]
We propose a framework called MotionPrompt that guides the video generation process via optical flow. We optimize learnable token embeddings during reverse sampling steps by using gradients from a trained discriminator applied to random frame pairs. This approach allows our method to generate visually coherent video sequences that closely reflect natural motion dynamics, without compromising the fidelity of the generated content.
arXiv Detail & Related papers (2024-11-23T12:26:52Z)
AU-vMAE: Knowledge-Guide Action Units Detection via Video Masked Autoencoder [38.04963261966939]
We propose a video-level pre-training scheme for facial action units (FAU) detection. At the heart of our design is a pre-trained video feature extractor based on the video-masked autoencoder. Our approach demonstrates substantial enhancement in performance compared to the existing state-of-the-art methods used in BP4D and DISFA FAUs datasets.
arXiv Detail & Related papers (2024-07-16T08:07:47Z)
Mask Propagation for Efficient Video Semantic Segmentation [63.09523058489429]
Video Semantic baseline degradation (VSS) involves assigning a semantic label to each pixel in a video sequence. We propose an efficient mask propagation framework for VSS, called SSSS. Our framework reduces up to 4x FLOPs compared to the per-frame Mask2Former with only up to 2% mIoU on the Cityscapes validation set.
arXiv Detail & Related papers (2023-10-29T09:55:28Z)
MGMAE: Motion Guided Masking for Video Masked Autoencoding [34.80832206608387]
Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE.
arXiv Detail & Related papers (2023-08-21T15:39:41Z)
Unmasked Teacher: Towards Training-Efficient Video Foundation Models [50.19560876891811]
Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. Our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding.
arXiv Detail & Related papers (2023-03-28T15:39:28Z)
Masked Contrastive Pre-Training for Efficient Video-Text Retrieval [37.05164804180039]
We present a simple yet effective end-to-end Video-language Pre-training (VidLP) framework, Masked Contrastive Video-language Pretraining (MAC) Our MAC aims to reduce video representation's spatial and temporal redundancy in the VidLP model. Coupling these designs enables efficient end-to-end pre-training: reduce FLOPs (60% off), accelerate pre-training (by 3x), and improve performance.
arXiv Detail & Related papers (2022-12-02T05:44:23Z)
Video Mask Transfiner for High-Quality Video Instance Segmentation [102.50936366583106]
Video Mask Transfiner (VMT) is capable of leveraging fine-grained high-resolution features thanks to a highly efficient video transformer structure. Based on our VMT architecture, we design an automated annotation refinement approach by iterative training and self-correction. We compare VMT with the most recent state-of-the-art methods on the HQ-YTVIS, as well as the Youtube-VIS, OVIS and BDD100K MOTS.
arXiv Detail & Related papers (2022-07-28T11:13:37Z)
PGT: A Progressive Method for Training Models on Long Videos [45.935259079953255]
Main-stream method is to split a raw video into clips, leading to incomplete temporal information flow. Inspired by natural language processing techniques dealing with long sentences, we propose to treat videos as serial fragments satisfying Markov property. We empirically demonstrate that it yields significant performance improvements on different models and datasets.
arXiv Detail & Related papers (2021-03-21T06:15:20Z)
Generating Masks from Boxes by Mining Spatio-Temporal Consistencies in Videos [159.02703673838639]
We introduce a method for generating segmentation masks from per-frame bounding box annotations in videos. We use our resulting accurate masks for weakly supervised training of video object segmentation (VOS) networks. The additional data provides substantially better generalization performance leading to state-of-the-art results in both the VOS and more challenging tracking domain.
arXiv Detail & Related papers (2021-01-06T18:56:24Z)

This list is automatically generated from the titles and abstracts of the papers in this site.