DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
- URL: http://arxiv.org/abs/2510.02253v2
- Date: Thu, 23 Oct 2025 17:58:02 GMT
- Title: DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing
- Authors: Zihan Zhou, Shilin Lu, Shuli Leng, Shaocong Zhang, Zhuming Lian, Xinlei Yu, Adams Wai-Kin Kong,
- Abstract summary: This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow.<n>To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision.<n>Experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines.
- Score: 19.031261008813644
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Drag-based image editing has long suffered from distortions in the target region, largely because the priors of earlier base models, Stable Diffusion, are insufficient to project optimized latents back onto the natural image manifold. With the shift from UNet-based DDPMs to more scalable DiT with flow matching (e.g., SD3.5, FLUX), generative priors have become significantly stronger, enabling advances across diverse editing tasks. However, drag-based editing has yet to benefit from these stronger priors. This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow, achieving substantial gains over baselines. We first show that directly applying point-based drag editing to DiTs performs poorly: unlike the highly compressed features of UNets, DiT features are insufficiently structured to provide reliable guidance for point-wise motion supervision. To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision. Additionally, we integrate pretrained open-domain personalization adapters (e.g., IP-Adapter) to enhance subject consistency, while preserving background fidelity through gradient mask-based hard constraints. Multimodal large language models (MLLMs) are further employed to resolve task ambiguities. For evaluation, we curate a novel Region-based Dragging benchmark (ReD Bench) featuring region-level dragging instructions. Extensive experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines, setting a new state-of-the-art in drag-based image editing. Code and datasets will be publicly available upon publication.
Related papers
- ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention [81.12932992203885]
We introduce ContextDrag, a new paradigm for drag-based editing.<n>By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details.
arXiv Detail & Related papers (2025-12-09T10:51:45Z) - TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z) - LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence [31.686266704795273]
We introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers.<n>LazyDrag directly eliminates the reliance on implicit point matching.<n>It unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach.
arXiv Detail & Related papers (2025-09-15T17:59:47Z) - FlowDrag: 3D-aware Drag-based Image Editing with Mesh-guided Deformation Vector Flow Fields [20.793887576117527]
We propose FlowDrag, which leverages geometric information for more accurate and coherent transformations.<n>Our approach constructs a 3D mesh from the image, using an energy function to guide mesh deformation based on user-defined drag points.<n>The resulting mesh displacements are projected into 2D and incorporated into a UNet denoising process, enabling precise handle-to-target point alignment.
arXiv Detail & Related papers (2025-07-11T03:18:52Z) - DragNeXt: Rethinking Drag-Based Image Editing [81.9430401732008]
Drag-Based Image Editing (DBIE) allows users to manipulate images by directly dragging objects within them.<n>It faces two key challenges: (emphtextcolormagentaii) point-based drag is often highly ambiguous and difficult to align with users' intentions.<n>We propose a simple-yet-effective editing framework, dubbed textcolorSkyBluetextbfDragNeXt.
arXiv Detail & Related papers (2025-06-09T10:24:29Z) - DragLoRA: Online Optimization of LoRA Adapters for Drag-based Image Editing in Diffusion Model [14.144755955903634]
DragLoRA is a novel framework that integrates LoRA adapters into the drag-based editing pipeline.<n>We show that DragLoRA significantly enhances the control precision and computational efficiency for drag-based image editing.
arXiv Detail & Related papers (2025-05-18T13:52:19Z) - Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting [55.14822004410817]
We introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting.<n>It enables precise control over the extent of editing through the input of 3D masks and pairs of control points.<n>DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results.
arXiv Detail & Related papers (2025-01-30T18:51:54Z) - Task-Oriented Diffusion Inversion for High-Fidelity Text-based Editing [60.730661748555214]
We introduce textbfTask-textbfOriented textbfDiffusion textbfInversion (textbfTODInv), a novel framework that inverts and edits real images tailored to specific editing tasks.
ToDInv seamlessly integrates inversion and editing through reciprocal optimization, ensuring both high fidelity and precise editability.
arXiv Detail & Related papers (2024-08-23T22:16:34Z) - LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos [101.59710862476041]
We present LightningDrag, a rapid approach enabling high quality drag-based image editing in 1 second.
Unlike most previous methods, we redefine drag-based editing as a conditional generation task.
Our approach can significantly outperform previous methods in terms of accuracy and consistency.
arXiv Detail & Related papers (2024-05-22T15:14:00Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.