FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
- URL: http://arxiv.org/abs/2506.05046v1
- Date: Thu, 05 Jun 2025 13:54:40 GMT
- Title: FlowDirector: Training-Free Flow Steering for Precise Text-to-Video Editing
- Authors: Guangzhao Li, Yanming Yang, Chenxi Song, Chi Zhang,
- Abstract summary: FlowDirector is a novel inversion-free video editing framework.<n>Our framework models the editing process as a direct evolution in data space.<n>To achieve localized and controllable edits, we introduce an attention-guided masking mechanism.
- Score: 2.7248421583285265
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Text-driven video editing aims to modify video content according to natural language instructions. While recent training-free approaches have made progress by leveraging pre-trained diffusion models, they typically rely on inversion-based techniques that map input videos into the latent space, which often leads to temporal inconsistencies and degraded structural fidelity. To address this, we propose FlowDirector, a novel inversion-free video editing framework. Our framework models the editing process as a direct evolution in data space, guiding the video via an Ordinary Differential Equation (ODE) to smoothly transition along its inherent spatiotemporal manifold, thereby preserving temporal coherence and structural details. To achieve localized and controllable edits, we introduce an attention-guided masking mechanism that modulates the ODE velocity field, preserving non-target regions both spatially and temporally. Furthermore, to address incomplete edits and enhance semantic alignment with editing instructions, we present a guidance-enhanced editing strategy inspired by Classifier-Free Guidance, which leverages differential signals between multiple candidate flows to steer the editing trajectory toward stronger semantic alignment without compromising structural consistency. Extensive experiments across benchmarks demonstrate that FlowDirector achieves state-of-the-art performance in instruction adherence, temporal consistency, and background preservation, establishing a new paradigm for efficient and coherent video editing without inversion.
Related papers
- Edit as You See: Image-guided Video Editing via Masked Motion Modeling [18.89936405508778]
We propose a novel Image-guided Video Editing Diffusion model, termed IVEDiff.<n>IVEDiff is built on top of image editing models, and is equipped with learnable motion modules to maintain the temporal consistency of edited video.<n>Our method is able to generate temporally smooth edited videos while robustly dealing with various editing objects with high quality.
arXiv Detail & Related papers (2025-01-08T07:52:12Z) - Re-Attentional Controllable Video Diffusion Editing [48.052781838711994]
We propose a Re-Attentional Controllable Video Diffusion Editing (ReAtCo) method.<n>To align the spatial placement of the target objects with the edited text prompt in a training-free manner, we propose a Re-Attentional Diffusion (RAD)<n>RAD refocuses the cross-attention activation responses between the edited text prompt and the target video during the denoising stage, resulting in a spatially location-aligned and semantically high-fidelity manipulated video.
arXiv Detail & Related papers (2024-12-16T12:32:21Z) - VIA: Unified Spatiotemporal Video Adaptation Framework for Global and Local Video Editing [91.60658973688996]
We introduce VIA, a unified Video Adaptation framework for global local video editing, pushing the limits of consistently editing minute-long videos.<n>To ensure local consistency within individual frames, we designed test-time editing adaptation to adapt a pre-trained image editing model.<n>We show that VIA can achieve consistent long video editing in minutes, unlocking the potential for advanced video editing tasks over long video sequences.
arXiv Detail & Related papers (2024-06-18T17:51:37Z) - COVE: Unleashing the Diffusion Feature Correspondence for Consistent Video Editing [57.76170824395532]
Video editing is an emerging task, in which most current methods adopt the pre-trained text-to-image (T2I) diffusion model to edit the source video.<n>We propose COrrespondence-guided Video Editing (COVE) to achieve high-quality and consistent video editing.<n>COVE can be seamlessly integrated into the pre-trained T2I diffusion model without the need for extra training or optimization.
arXiv Detail & Related papers (2024-06-13T06:27:13Z) - FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video
editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing.
Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module.
Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z) - Ground-A-Video: Zero-shot Grounded Video Editing using Text-to-image
Diffusion Models [65.268245109828]
Ground-A-Video is a video-to-video translation framework for multi-attribute video editing.
It attains temporally consistent editing of input videos in a training-free manner.
Experiments and applications demonstrate that Ground-A-Video's zero-shot capacity outperforms other baseline methods in terms of edit-accuracy and frame consistency.
arXiv Detail & Related papers (2023-10-02T11:28:37Z) - InstructVid2Vid: Controllable Video Editing with Natural Language Instructions [97.17047888215284]
InstructVid2Vid is an end-to-end diffusion-based methodology for video editing guided by human language instructions.
Our approach empowers video manipulation guided by natural language directives, eliminating the need for per-example fine-tuning or inversion.
arXiv Detail & Related papers (2023-05-21T03:28:13Z) - Structure and Content-Guided Video Synthesis with Diffusion Models [13.464501385061032]
We present a structure and content-guided video diffusion model that edits videos based on visual or textual descriptions of the desired output.
Our model is trained jointly on images and videos which also exposes explicit control of temporal consistency through a novel guidance method.
arXiv Detail & Related papers (2023-02-06T18:50:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.