DragNeXt: Rethinking Drag-Based Image Editing
- URL: http://arxiv.org/abs/2506.07611v1
- Date: Mon, 09 Jun 2025 10:24:29 GMT
- Title: DragNeXt: Rethinking Drag-Based Image Editing
- Authors: Yuan Zhou, Junbao Zhou, Qingshan Xu, Kesen Zhao, Yuxuan Wang, Hao Fei, Richang Hong, Hanwang Zhang,
- Abstract summary: Drag-Based Image Editing (DBIE) allows users to manipulate images by directly dragging objects within them.<n>It faces two key challenges: (emphtextcolormagentaii) point-based drag is often highly ambiguous and difficult to align with users' intentions.<n>We propose a simple-yet-effective editing framework, dubbed textcolorSkyBluetextbfDragNeXt.
- Score: 81.9430401732008
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Drag-Based Image Editing (DBIE), which allows users to manipulate images by directly dragging objects within them, has recently attracted much attention from the community. However, it faces two key challenges: (\emph{\textcolor{magenta}{i}}) point-based drag is often highly ambiguous and difficult to align with users' intentions; (\emph{\textcolor{magenta}{ii}}) current DBIE methods primarily rely on alternating between motion supervision and point tracking, which is not only cumbersome but also fails to produce high-quality results. These limitations motivate us to explore DBIE from a new perspective -- redefining it as deformation, rotation, and translation of user-specified handle regions. Thereby, by requiring users to explicitly specify both drag areas and types, we can effectively address the ambiguity issue. Furthermore, we propose a simple-yet-effective editing framework, dubbed \textcolor{SkyBlue}{\textbf{DragNeXt}}. It unifies DBIE as a Latent Region Optimization (LRO) problem and solves it through Progressive Backward Self-Intervention (PBSI), simplifying the overall procedure of DBIE while further enhancing quality by fully leveraging region-level structure information and progressive guidance from intermediate drag states. We validate \textcolor{SkyBlue}{\textbf{DragNeXt}} on our NextBench, and extensive experiments demonstrate that our proposed method can significantly outperform existing approaches. Code will be released on github.
Related papers
- ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention [81.12932992203885]
We introduce ContextDrag, a new paradigm for drag-based editing.<n>By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details.
arXiv Detail & Related papers (2025-12-09T10:51:45Z) - Streaming Drag-Oriented Interactive Video Manipulation: Drag Anything, Anytime! [88.12304235156591]
We propose textbfstReaming drag-oriEnted interactiVe vidEo manipuLation (REVEL), a new task that enables users to modify generated videos emphanytime on emphanything via fine-grained, interactive drag.<n>Our method can be seamlessly integrated into existing autoregressive video diffusion models.
arXiv Detail & Related papers (2025-10-03T22:38:35Z) - DragFlow: Unleashing DiT Priors with Region Based Supervision for Drag Editing [19.031261008813644]
This work proposes the first framework to effectively harness FLUX's rich prior for drag-based editing, dubbed DragFlow.<n>To overcome this limitation, DragFlow introduces a region-based editing paradigm, where affine transformations enable richer and more consistent feature supervision.<n>Experiments on DragBench-DR and ReD Bench show that DragFlow surpasses both point-based and region-based baselines.
arXiv Detail & Related papers (2025-10-02T17:39:13Z) - TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z) - LazyDrag: Enabling Stable Drag-Based Editing on Multi-Modal Diffusion Transformers via Explicit Correspondence [31.686266704795273]
We introduce LazyDrag, the first drag-based image editing method for Multi-Modal Diffusion Transformers.<n>LazyDrag directly eliminates the reliance on implicit point matching.<n>It unifies precise geometric control with text guidance, enabling complex edits that were previously out of reach.
arXiv Detail & Related papers (2025-09-15T17:59:47Z) - CLIPDrag: Combining Text-based and Drag-based Instructions for Image Editing [9.398831289389749]
We propose textbfCLIPDrag, a novel image editing method that combines text and drag signals for precise and ambiguity-free manipulations.<n>CLIPDrag outperforms existing single drag-based methods or text-based methods.
arXiv Detail & Related papers (2024-10-04T02:46:09Z) - RegionDrag: Fast Region-Based Image Editing with Diffusion Models [14.65208340413507]
RegionDrag is a copy-and-paste dragging method that allows users to express their editing instructions in the form of handle and target regions.
RegionDrag completes the edit on an image with a resolution of 512x512 in less than 2 seconds, which is more than 100x faster than DragDiffusion.
arXiv Detail & Related papers (2024-07-25T17:59:13Z) - LightningDrag: Lightning Fast and Accurate Drag-based Image Editing Emerging from Videos [101.59710862476041]
We present LightningDrag, a rapid approach enabling high quality drag-based image editing in 1 second.
Unlike most previous methods, we redefine drag-based editing as a conditional generation task.
Our approach can significantly outperform previous methods in terms of accuracy and consistency.
arXiv Detail & Related papers (2024-05-22T15:14:00Z) - ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE.
We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix.
We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z) - RIGID: Recurrent GAN Inversion and Editing of Real Face Videos [73.97520691413006]
GAN inversion is indispensable for applying the powerful editability of GAN to real images.
Existing methods invert video frames individually often leading to undesired inconsistent results over time.
We propose a unified recurrent framework, named textbfRecurrent vtextbfIdeo textbfGAN textbfInversion and etextbfDiting (RIGID)
Our framework learns the inherent coherence between input frames in an end-to-end manner.
arXiv Detail & Related papers (2023-08-11T12:17:24Z) - FreeDrag: Feature Dragging for Reliable Point-based Image Editing [16.833998026980087]
We propose FreeDrag, a feature dragging methodology designed to free the burden on point tracking.
The FreeDrag incorporates two key designs, i.e., template feature via adaptive updating and line search with backtracking.
Our approach significantly outperforms pre-existing methodologies, offering reliable point-based editing even in various complex scenarios.
arXiv Detail & Related papers (2023-07-10T16:37:46Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold [79.94300820221996]
DragGAN is a new way of controlling generative adversarial networks (GANs)
DragGAN allows anyone to deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc.
Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking.
arXiv Detail & Related papers (2023-05-18T13:41:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.