Flowception: Temporally Expansive Flow Matching for Video Generation
- URL: http://arxiv.org/abs/2512.11438v1
- Date: Fri, 12 Dec 2025 10:23:47 GMT
- Title: Flowception: Temporally Expansive Flow Matching for Video Generation
- Authors: Tariq Berrada Ifriqi, John Nguyen, Karteek Alahari, Jakob Verbeek, Ricky T. Q. Chen,
- Abstract summary: Flowception is a non-autoregressive and variable-length video generation framework.<n>It learns a probability path that interleaves discrete frame insertions with continuous frame denoising.<n>By learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video.
- Score: 35.14803469800522
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Flowception, a novel non-autoregressive and variable-length video generation framework. Flowception learns a probability path that interleaves discrete frame insertions with continuous frame denoising. Compared to autoregressive methods, Flowception alleviates error accumulation/drift as the frame insertion mechanism during sampling serves as an efficient compression mechanism to handle long-term context. Compared to full-sequence flows, our method reduces FLOPs for training three-fold, while also being more amenable to local attention variants, and allowing to learn the length of videos jointly with their content. Quantitative experimental results show improved FVD and VBench metrics over autoregressive and full-sequence baselines, which is further validated with qualitative results. Finally, by learning to insert and denoise frames in a sequence, Flowception seamlessly integrates different tasks such as image-to-video generation and video interpolation.
Related papers
- DiTVR: Zero-Shot Diffusion Transformer for Video Restoration [48.97196894658511]
DiTVR is a zero shot video restoration framework that couples a diffusion transformer with trajectory aware attention and a flow consistent sampler.<n>Our attention mechanism aligns tokens along optical flow trajectories, with particular emphasis on vital layers that exhibit the highest sensitivity to temporal dynamics.<n>The flow guided sampler injects data consistency only into low-frequency bands, preserving high frequency priors while accelerating cache.
arXiv Detail & Related papers (2025-08-11T09:54:45Z) - DIFFVSGG: Diffusion-Driven Online Video Scene Graph Generation [61.59996525424585]
DIFFVSGG is an online VSGG solution that frames this task as an iterative scene graph update problem.<n>We unify the decoding of object classification, bounding box regression, and graph generation three tasks using one shared feature embedding.<n>DIFFVSGG further facilitates continuous temporal reasoning, where predictions for subsequent frames leverage results of past frames as the conditional inputs of LDMs.
arXiv Detail & Related papers (2025-03-18T06:49:51Z) - Ouroboros-Diffusion: Exploring Consistent Content Generation in Tuning-free Long Video Diffusion [116.40704026922671]
First-in-first-out (FIFO) video diffusion, built on a pre-trained text-to-video model, has recently emerged as an effective approach for tuning-free long video generation.<n>We propose Ouroboros-Diffusion, a novel video denoising framework designed to enhance structural and content (subject) consistency.
arXiv Detail & Related papers (2025-01-15T18:59:15Z) - Coherent Video Inpainting Using Optical Flow-Guided Efficient Diffusion [15.188335671278024]
We propose a new video inpainting framework using optical Flow-guided Efficient Diffusion (FloED) for higher video coherence.<n>FloED employs a dual-branch architecture, where the time-agnostic flow branch restores corrupted flow first, and the multi-scale flow adapters provide motion guidance to the main inpainting branch.<n>Experiments on background restoration and object removal tasks show that FloED outperforms state-of-the-art diffusion-based methods in both quality and efficiency.
arXiv Detail & Related papers (2024-12-01T15:45:26Z) - Motion-Aware Video Frame Interpolation [49.49668436390514]
We introduce a Motion-Aware Video Frame Interpolation (MA-VFI) network, which directly estimates intermediate optical flow from consecutive frames.
It not only extracts global semantic relationships and spatial details from input frames with different receptive fields, but also effectively reduces the required computational cost and complexity.
arXiv Detail & Related papers (2024-02-05T11:00:14Z) - StreamFlow: Streamlined Multi-Frame Optical Flow Estimation for Video
Sequences [31.210626775505407]
Occlusions between consecutive frames have long posed a significant challenge in optical flow estimation.
We present a Streamlined In-batch Multi-frame (SIM) pipeline tailored to video input, attaining a similar level of time efficiency to two-frame networks.
StreamFlow not only excels in terms of performance on challenging KITTI and Sintel datasets, with particular improvement in occluded areas.
arXiv Detail & Related papers (2023-11-28T07:53:51Z) - AccFlow: Backward Accumulation for Long-Range Optical Flow [70.4251045372285]
This paper proposes a novel recurrent framework called AccFlow for long-range optical flow estimation.
We demonstrate the superiority of backward accumulation over conventional forward accumulation.
Experiments validate the effectiveness of AccFlow in handling long-range optical flow estimation.
arXiv Detail & Related papers (2023-08-25T01:51:26Z) - Flow Guidance Deformable Compensation Network for Video Frame
Interpolation [33.106776459443275]
We propose a flow guidance deformable compensation network (FGDCN) to overcome the drawbacks of existing motion-based methods.
FGDCN decomposes the frame sampling process into two steps: a flow step and a deformation step.
Experimental results show that the proposed algorithm achieves excellent performance on various datasets with fewer parameters.
arXiv Detail & Related papers (2022-11-22T09:35:14Z) - Error Compensation Framework for Flow-Guided Video Inpainting [36.626793485786095]
We propose an Error Compensation Framework for Flow-guided Video Inpainting (ECFVI)
Our approach greatly improves the temporal consistency and the visual quality of the completed videos.
arXiv Detail & Related papers (2022-07-21T10:02:57Z) - FLAVR: Flow-Agnostic Video Representations for Fast Frame Interpolation [97.99012124785177]
FLAVR is a flexible and efficient architecture that uses 3D space-time convolutions to enable end-to-end learning and inference for video framesupervised.
We demonstrate that FLAVR can serve as a useful self- pretext task for action recognition, optical flow estimation, and motion magnification.
arXiv Detail & Related papers (2020-12-15T18:59:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.