Related papers: PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing

URL: http://arxiv.org/abs/2512.24026v1
Date: Tue, 30 Dec 2025 06:54:57 GMT
Title: PipeFlow: Pipelined Processing and Motion-Aware Frame Selection for Long-Form Video Editing
Authors: Mustafa Munir, Md Mostafijur Rahman, Kartikeya Bhardwaj, Paul Whatmough, Radu Marculescu,
Abstract summary: We propose PipeFlow, a scalable, pipelined video editing method.<n>Based on a motion analysis, we identify and propose to skip editing of frames with low motion.<n>Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length.
Score: 29.552187111796403
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Long-form video editing poses unique challenges due to the exponential increase in the computational cost from joint editing and Denoising Diffusion Implicit Models (DDIM) inversion across extended sequences. To address these limitations, we propose PipeFlow, a scalable, pipelined video editing method that introduces three key innovations: First, based on a motion analysis using Structural Similarity Index Measure (SSIM) and Optical Flow, we identify and propose to skip editing of frames with low motion. Second, we propose a pipelined task scheduling algorithm that splits a video into multiple segments and performs DDIM inversion and joint editing in parallel based on available GPU memory. Lastly, we leverage a neural network-based interpolation technique to smooth out the border frames between segments and interpolate the previously skipped frames. Our method uniquely scales to longer videos by dividing them into smaller segments, allowing PipeFlow's editing time to increase linearly with video length. In principle, this enables editing of infinitely long videos without the growing per-frame computational overhead encountered by other methods. PipeFlow achieves up to a 9.6X speedup compared to TokenFlow and a 31.7X speedup over Diffusion Motion Transfer (DMT).

Related papers

RFDM: Residual Flow Diffusion Model for Efficient Causal Video Editing [15.876564621196684]
We introduce a causal, efficient video editing model that edits variable-length videos frame by frame.<n>For efficiency, we start from a 2D image-to-image (I2I) diffusion model and adapt it to video-to-video (V2V) editing.<n>We call this Residual Flow Diffusion Model (RFDM), which focuses the denoising process on changes between consecutive frames.
arXiv Detail & Related papers (2026-02-06T16:56:30Z)
PipeDiT: Accelerating Diffusion Transformers in Video Generation with Task Pipelining and Model Decoupling [18.079843329153412]
diffusion transformer (DiT) based models have demonstrated remark- able capabilities.<n>However, their practical deployment is hindered by slow inference speeds and high memory con- sumption.<n>We propose a novel pipelining frame- work named PipeDiT to accelerate video generation.
arXiv Detail & Related papers (2025-11-15T06:46:40Z)
An LMM for Efficient Video Understanding via Reinforced Compression of Video Cubes [85.00111442236499]
This paper presents textbfQuicksviewer, an LMM with new perceiving paradigm that partitions a video of nontemporal density into varying cubes using Gumbel Softmax.<n>We train the model from a language backbone through three progressive stages, each incorporating lengthy videos on average of 420s/1fps thanks to the perceiving efficiency.<n>With only 0.8M total video-text samples for training, our model outperforms the direct baseline employing a fixed partitioning strategy by a maximum of 8.72 in accuracy.
arXiv Detail & Related papers (2025-04-21T17:57:21Z)
Generating, Fast and Slow: Scalable Parallel Video Generation with Video Interface Networks [21.710127132217526]
We introduce a new paradigm called Video Interface Networks (VINs), which augment DiTs with an abstraction module to enable parallel inference of video chunks.<n>VINs encode global semantics from the noisy input of local chunks and the encoded representations, in turn, guide DiTs in denoising chunks in parallel.<n>Our approach attains state-of-the-art motion smoothness while using 25-40% fewer FLOPs than full generation.
arXiv Detail & Related papers (2025-03-21T21:13:02Z)
Look Every Frame All at Once: Video-Ma$^2$mba for Efficient Long-form Video Understanding with Multi-Axis Gradient Checkpointing [52.050036778325094]
Video-Ma$2$mba is a novel architecture that incorporates State Space Models (SSMs) within the Mamba-2 framework.<n>Our approach significantly reduces the memory footprint compared to standard gradient checkpointing.<n>By maintaining a detailed capture of temporal dynamics, our model improves the accuracy and relevance of responses in long video understanding tasks.
arXiv Detail & Related papers (2024-11-29T04:12:13Z)
ViBiDSampler: Enhancing Video Interpolation Using Bidirectional Diffusion Sampler [53.98558445900626]
Current image-to-video diffusion models, while powerful in generating videos from a single frame, need adaptation for two-frame conditioned generation.<n>We introduce a novel, bidirectional sampling strategy to address these off-manifold issues without requiring extensive re-noising or fine-tuning.<n>Our method employs sequential sampling along both forward and backward paths, conditioned on the start and end frames, respectively, ensuring more coherent and on-manifold generation of intermediate frames.
arXiv Detail & Related papers (2024-10-08T03:01:54Z)
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames [57.758863967770594]
We build on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion.<n>We expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed.
arXiv Detail & Related papers (2023-12-12T16:10:19Z)
LOVECon: Text-driven Training-Free Long Video Editing with ControlNet [9.762680144118061]
This paper aims to bridge the gap, establishing a simple and effective baseline for training-free diffusion model-based long video editing. We build the pipeline upon ControlNet, which excels at various image editing tasks based on text prompts. Our method manages to edit videos comprising hundreds of frames according to user requirements.
arXiv Detail & Related papers (2023-10-15T02:39:25Z)
Neighbor Correspondence Matching for Flow-based Video Frame Synthesis [90.14161060260012]
We introduce a neighbor correspondence matching (NCM) algorithm for flow-based frame synthesis. NCM is performed in a current-frame-agnostic fashion to establish multi-scale correspondences in the spatial-temporal neighborhoods of each pixel. coarse-scale module is designed to leverage neighbor correspondences to capture large motion, while the fine-scale module is more efficient to speed up the estimation process.
arXiv Detail & Related papers (2022-07-14T09:17:00Z)
Efficient Video Segmentation Models with Per-frame Inference [117.97423110566963]
We focus on improving the temporal consistency without introducing overhead in inference. We propose several techniques to learn from the video sequence, including a temporal consistency loss and online/offline knowledge distillation methods.
arXiv Detail & Related papers (2022-02-24T23:51:36Z)
FastRIFE: Optimization of Real-Time Intermediate Flow Estimation for Video Frame Interpolation [0.0]
This paper proposes the FastRIFE algorithm, which is some speed improvement of the RIFE (Real-Time Intermediate Flow Estimation) model. All source codes are available at https://gitlab.com/malwinq/interpolation-of-images-for-slow-motion-videos.
arXiv Detail & Related papers (2021-05-27T22:31:40Z)

This list is automatically generated from the titles and abstracts of the papers in this site.