Related papers: Consistent Video Editing as Flow-Driven Image-to-Video Generation

Consistent Video Editing as Flow-Driven Image-to-Video Generation

URL: http://arxiv.org/abs/2506.07713v2
Date: Fri, 13 Jun 2025 09:10:58 GMT
Title: Consistent Video Editing as Flow-Driven Image-to-Video Generation
Authors: Ge Wang, Songlin Fan, Hangxu Liu, Quanjian Song, Hewei Wang, Jinfeng Xu,
Abstract summary: FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape.<n> Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones.
Score: 6.03121849763522
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the prosper of video diffusion models, down-stream applications like video editing have been significantly promoted without consuming much computational cost. One particular challenge in this task lies at the motion transfer process from the source video to the edited one, where it requires the consideration of the shape deformation in between, meanwhile maintaining the temporal consistency in the generated video sequence. However, existing methods fail to model complicated motion patterns for video editing, and are fundamentally limited to object replacement, where tasks with non-rigid object motions like multi-object and portrait editing are largely neglected. In this paper, we observe that optical flows offer a promising alternative in complex motion modeling, and present FlowV2V to re-investigate video editing as a task of flow-driven Image-to-Video (I2V) generation. Specifically, FlowV2V decomposes the entire pipeline into first-frame editing and conditional I2V generation, and simulates pseudo flow sequence that aligns with the deformed shape, thus ensuring the consistency during editing. Experimental results on DAVIS-EDIT with improvements of 13.67% and 50.66% on DOVER and warping error illustrate the superior temporal consistency and sample quality of FlowV2V compared to existing state-of-the-art ones. Furthermore, we conduct comprehensive ablation studies to analyze the internal functionalities of the first-frame paradigm and flow alignment in the proposed method.

Related papers

DFVEdit: Conditional Delta Flow Vector for Zero-shot Video Editing [18.86599058385878]
We present DFVEdit, an efficient zero-shot video editing method tailored for Video DiTs.<n>DFVEdit eliminates the need for both attention modification and fine-tuning by directly operating on clean latents via flow transformation.<n>Building upon this foundation, we propose the Conditional Delta Flow Vector (CDFV) and integrate Implicit Cross Attention (ICA) guidance.
arXiv Detail & Related papers (2025-06-26T03:10:13Z)
VideoDirector: Precise Video Editing via Text-to-Video Models [45.53826541639349]
Current video editing methods rely on text-to-video (T2V) models, which inherently lack temporal-coherence generative ability.<n>We propose a spatial-temporal decoupled guidance (STDG) and multi-frame null-text optimization strategy to provide pivotal temporal cues for more precise pivotal inversion.<n> Experimental results demonstrate that our method effectively harnesses the powerful temporal generation capabilities of T2V models.
arXiv Detail & Related papers (2024-11-26T16:56:53Z)
Taming Rectified Flow for Inversion and Editing [57.3742655030493]
Rectified-flow-based diffusion transformers like FLUX and OpenSora have demonstrated outstanding performance in the field of image and video generation.<n>Despite their robust generative capabilities, these models often struggle with inaccuracies.<n>We propose RF-r, a training-free sampler that effectively enhances inversion precision by mitigating the errors in the inversion process of rectified flow.
arXiv Detail & Related papers (2024-11-07T14:29:02Z)
HOI-Swap: Swapping Objects in Videos with Hand-Object Interaction Awareness [57.18183962641015]
We present HOI-Swap, a video editing framework trained in a self-supervised manner. The first stage focuses on object swapping in a single frame with HOI awareness. The second stage extends the single-frame edit across the entire sequence.
arXiv Detail & Related papers (2024-06-11T22:31:29Z)
Zero-Shot Video Editing through Adaptive Sliding Score Distillation [51.57440923362033]
This study proposes a novel paradigm of video-based score distillation, facilitating direct manipulation of original video content. We propose an Adaptive Sliding Score Distillation strategy, which incorporates both global and local video guidance to reduce the impact of editing errors.
arXiv Detail & Related papers (2024-06-07T12:33:59Z)
I2VEdit: First-Frame-Guided Video Editing via Image-to-Video Diffusion Models [18.36472998650704]
We introduce a novel and generic solution that extends the applicability of image editing tools to videos by propagating edits from a single frame to the entire video using a pre-trained image-to-video model. Our method, dubbed I2VEdit, adaptively preserves the visual and motion integrity of the source video depending on the extent of the edits.
arXiv Detail & Related papers (2024-05-26T11:47:40Z)
AnyV2V: A Tuning-Free Framework For Any Video-to-Video Editing Tasks [41.640692114423544]
We introduce AnyV2V, a novel tuning-free paradigm designed to simplify video editing. AnyV2V can leverage any existing image editing tools to support an extensive array of video editing tasks. Our evaluation shows that AnyV2V achieved CLIP-scores comparable to other baseline methods.
arXiv Detail & Related papers (2024-03-21T15:15:00Z)
FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis [66.2611385251157]
Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video.
arXiv Detail & Related papers (2023-12-29T16:57:12Z)
Edit Temporal-Consistent Videos with Image Diffusion Model [49.88186997567138]
Large-scale text-to-image (T2I) diffusion models have been extended for text-guided video editing. T achieves state-of-the-art performance in both video temporal consistency and video editing capability.
arXiv Detail & Related papers (2023-08-17T16:40:55Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.