InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
- URL: http://arxiv.org/abs/2510.08181v1
- Date: Thu, 09 Oct 2025 13:06:49 GMT
- Title: InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing
- Authors: Haoran Yu, Yi Shi,
- Abstract summary: InstructUDrag is a diffusion-based framework that combines text instructions with object dragging.<n>Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches.<n>InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
- Score: 6.95116998047811
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Text-to-image diffusion models have shown great potential for image editing, with techniques such as text-based and object-dragging methods emerging as key approaches. However, each of these methods has inherent limitations: text-based methods struggle with precise object positioning, while object dragging methods are confined to static relocation. To address these issues, we propose InstructUDrag, a diffusion-based framework that combines text instructions with object dragging, enabling simultaneous object dragging and text-based image editing. Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches. The moving-reconstruction branch utilizes energy-based gradient guidance to move objects accurately, refining cross-attention maps to enhance relocation precision. The text-driven editing branch shares gradient signals with the reconstruction branch, ensuring consistent transformations and allowing fine-grained control over object attributes. We also employ DDPM inversion and inject prior information into noise maps to preserve the structure of moved objects. Extensive experiments demonstrate that InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
Related papers
- POCI-Diff: Position Objects Consistently and Interactively with 3D-Layout Guided Diffusion [46.97254555348757]
We propose a diffusion-based approach for Text-to-Image (T2I) generation with consistent and interactive 3D layout control and editing.<n>We introduce a framework for Positioning Objects Consistently and Interactively (POCI-Diff)<n>Our method enables explicit per-object semantic control by binding individual text descriptions to specific 3D bounding boxes.
arXiv Detail & Related papers (2026-01-20T15:13:43Z) - ContextDrag: Precise Drag-Based Image Editing via Context-Preserving Token Injection and Position-Consistent Attention [81.12932992203885]
We introduce ContextDrag, a new paradigm for drag-based editing.<n>By incorporating VAE-encoded features from the reference image, ContextDrag can leverage rich contextual cues and preserve fine-grained details.
arXiv Detail & Related papers (2025-12-09T10:51:45Z) - TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z) - Move and Act: Enhanced Object Manipulation and Background Integrity for Image Editing [20.01946775715704]
We propose a tuning-free method with only two branches: inversion and editing.<n>This approach allows users to simultaneously edit the object's action and control the generation position of the edited object.<n> Impressive image editing results and quantitative evaluation demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-25T08:00:49Z) - DragText: Rethinking Text Embedding in Point-based Image Editing [3.4248731707266264]
Point-based image editing enables accurate and flexible control through content dragging.<n>The role of text embedding during the editing process has not been thoroughly investigated.<n>We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z) - Diffree: Text-Guided Shape Free Object Inpainting with Diffusion Model [81.96954332787655]
We introduce Diffree, a Text-to-Image (T2I) model that facilitates text-guided object addition with only text control.
In experiments, Diffree adds new objects with a high success rate while maintaining background consistency, spatial, and object relevance and quality.
arXiv Detail & Related papers (2024-07-24T03:58:58Z) - DiffUHaul: A Training-Free Method for Object Dragging in Images [78.93531472479202]
We propose a training-free method, dubbed DiffUHaul, for the object dragging task.
We first apply attention masking in each denoising step to make the generation more disentangled across different objects.
In the early denoising steps, we interpolate the attention features between source and target images to smoothly fuse new layouts with the original appearance.
arXiv Detail & Related papers (2024-06-03T17:59:53Z) - DragonDiffusion: Enabling Drag-style Manipulation on Diffusion Models [66.43179841884098]
We propose a novel image editing method, DragonDiffusion, enabling Drag-style manipulation on Diffusion models.
Our method achieves various editing modes for the generated or real images, such as object moving, object resizing, object appearance replacement, and content dragging.
arXiv Detail & Related papers (2023-07-05T16:43:56Z) - LayerDiffusion: Layered Controlled Image Editing with Diffusion Models [5.58892860792971]
LayerDiffusion is a semantic-based layered controlled image editing method.
We leverage a large-scale text-to-image model and employ a layered controlled optimization strategy.
Experimental results demonstrate the effectiveness of our method in generating highly coherent images.
arXiv Detail & Related papers (2023-05-30T01:26:41Z) - SIEDOB: Semantic Image Editing by Disentangling Object and Background [5.149242555705579]
We propose a novel paradigm for semantic image editing.
textbfSIEDOB, the core idea of which is to explicitly leverage several heterogeneousworks for objects and backgrounds.
We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines.
arXiv Detail & Related papers (2023-03-23T06:17:23Z) - ManiTrans: Entity-Level Text-Guided Image Manipulation via Token-wise
Semantic Alignment and Generation [97.36550187238177]
We study a novel task on text-guided image manipulation on the entity level in the real world.
The task imposes three basic requirements, (1) to edit the entity consistent with the text descriptions, (2) to preserve the text-irrelevant regions, and (3) to merge the manipulated entity into the image naturally.
Our framework incorporates a semantic alignment module to locate the image regions to be manipulated, and a semantic loss to help align the relationship between the vision and language.
arXiv Detail & Related papers (2022-04-09T09:01:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.