Related papers: PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models

URL: http://arxiv.org/abs/2602.20583v1
Date: Tue, 24 Feb 2026 06:11:08 GMT
Title: PropFly: Learning to Propagate via On-the-Fly Supervision from Pre-trained Video Diffusion Models
Authors: Wonyong Seo, Jaeho Moon, Jaehyup Lee, Soo Ye Kim, Munchurl Kim,
Abstract summary: PropFly is a training pipeline for propagation-based video editing.<n>PropFly relies on pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets.<n>Our pipeline enables an adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss.
Score: 35.59605874012795
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Propagation-based video editing enables precise user control by propagating a single edited frame into following frames while maintaining the original context such as motion and structures. However, training such models requires large-scale, paired (source and edited) video datasets, which are costly and complex to acquire. Hence, we propose the PropFly, a training pipeline for Propagation-based video editing, relying on on-the-Fly supervision from pre-trained video diffusion models (VDMs) instead of requiring off-the-shelf or precomputed paired video editing datasets. Specifically, our PropFly leverages one-step clean latent estimations from intermediate noised latents with varying Classifier-Free Guidance (CFG) scales to synthesize diverse pairs of 'source' (low-CFG) and 'edited' (high-CFG) latents on-the-fly. The source latent serves as structural information of the video, while the edited latent provides the target transformation for learning propagation. Our pipeline enables an additional adapter attached to the pre-trained VDM to learn to propagate edits via Guidance-Modulated Flow Matching (GMFM) loss, which guides the model to replicate the target transformation. Our on-the-fly supervision ensures the model to learn temporally consistent and dynamic transformations. Extensive experiments demonstrate that our PropFly significantly outperforms the state-of-the-art methods on various video editing tasks, producing high-quality editing results.

Related papers

VIVA: VLM-Guided Instruction-Based Video Editing with Reward Optimization [31.89256250882701]
VIVA is a scalable framework for instruction-based video editing.<n>It uses VLM-guided encoding and reward optimization.<n>We show that VIVA achieves superior instruction following, generalization, and editing quality over state-of-the-art methods.
arXiv Detail & Related papers (2025-12-18T18:58:42Z)
CtrlVDiff: Controllable Video Generation via Unified Multimodal Video Diffusion [62.04833878126661]
We tackle the dual challenges of video understanding and controllable video generation within a unified diffusion framework.<n>We propose CtrlVDiff, a unified diffusion model trained with a Hybrid Modality Control Strategy (HMCS) that routes and fuses features from depth, normals, segmentation, edges, and graphics-based intrinsics (albedo, roughness, metallic)<n>Across understanding and generation benchmarks, CtrlVDiff delivers superior controllability and fidelity, enabling layer-wise edits (relighting, material adjustment, object insertion) and surpassing state-of-the-art baselines while remaining robust when some modalities are unavailable.
arXiv Detail & Related papers (2025-11-26T07:27:11Z)
VALA: Learning Latent Anchors for Training-Free and Temporally Consistent [29.516179213427694]
We propose VALA, a variational alignment module that adaptively selects key frames and compresses their latent features into semantic anchors for consistent video editing.<n>Our method can be fully integrated into training-free text-to-image based video editing models.
arXiv Detail & Related papers (2025-10-27T03:44:11Z)
Self-Attention Decomposition For Training Free Diffusion Editing [18.8152476816527]
A key step toward controllability is to identify interpretable directions in the model's latent representations.<n>We propose an analytical method that derives semantic editing directions directly from the pretrained parameters of diffusion models.
arXiv Detail & Related papers (2025-10-26T12:22:56Z)
Learning an Image Editing Model without Image Editing Pairs [83.03646586929638]
Recent image editing models have achieved impressive results while following natural language editing instructions.<n>They rely on supervised fine-tuning with large datasets of input-target pairs.<n>Current workarounds use synthetic training pairs that leverage the zero-shot capabilities of existing models.<n>We present a new training paradigm that eliminates the need for paired data entirely.
arXiv Detail & Related papers (2025-10-16T17:59:57Z)
Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing. It attains temporally consistent editing of input videos in a training-free manner. Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z)
Boost Video Frame Interpolation via Motion Adaptation [73.42573856943923]
Video frame (VFI) is a challenging task that aims to generate intermediate frames between two consecutive frames in a video. Existing learning-based VFI methods have achieved great success, but they still suffer from limited generalization ability. We propose a novel optimization-based VFI method that can adapt to unseen motions at test time.
arXiv Detail & Related papers (2023-06-24T10:44:02Z)
InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions. Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z)
Transform-Equivariant Consistency Learning for Temporal Sentence Grounding [66.10949751429781]
We introduce a novel Equivariant Consistency Regulation Learning framework to learn more discriminative representations for each video. Our motivation comes from that the temporal boundary of the query-guided activity should be consistently predicted. In particular, we devise a self-supervised consistency loss module to enhance the completeness and smoothness of the augmented video.
arXiv Detail & Related papers (2023-05-06T19:29:28Z)
Structure and Content-Guided Video Synthesis with Diffusion Models [13.464501385061032]
We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output. Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
arXiv Detail & Related papers (2023-02-06T18:50:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.