Related papers: Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing

URL: http://arxiv.org/abs/2410.03097v1
Date: Fri, 4 Oct 2024 02:46:09 GMT
Title: Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing
Authors: Ziqi Jiang, Zhen Wang, Long Chen,
Abstract summary: We propose textbfCLIPDrag, a novel image editing method that combines text and drag signals for precise and ambiguity-free manipulations. CLIPDrag outperforms existing single drag-based methods or text-based methods.
Score: 9.398831289389749
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Precise and flexible image editing remains a fundamental challenge in computer vision. Based on the modified areas, most editing methods can be divided into two main types: global editing and local editing. In this paper, we choose the two most common editing approaches (ie text-based editing and drag-based editing) and analyze their drawbacks. Specifically, text-based methods often fail to describe the desired modifications precisely, while drag-based methods suffer from ambiguity. To address these issues, we proposed \textbf{CLIPDrag}, a novel image editing method that is the first to combine text and drag signals for precise and ambiguity-free manipulations on diffusion models. To fully leverage these two signals, we treat text signals as global guidance and drag points as local information. Then we introduce a novel global-local motion supervision method to integrate text signals into existing drag-based methods by adapting a pre-trained language-vision model like CLIP. Furthermore, we also address the problem of slow convergence in CLIPDrag by presenting a fast point-tracking method that enforces drag points moving toward correct directions. Extensive experiments demonstrate that CLIPDrag outperforms existing single drag-based methods or text-based methods.

Related papers

InstructUDrag: Joint Text Instructions and Object Dragging for Interactive Image Editing [6.95116998047811]
InstructUDrag is a diffusion-based framework that combines text instructions with object dragging.<n>Our framework treats object dragging as an image reconstruction process, divided into two synergistic branches.<n>InstructUDrag facilitates flexible, high-fidelity image editing, offering both precision in object relocation and semantic control over image content.
arXiv Detail & Related papers (2025-10-09T13:06:49Z)
TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation [51.72432192816058]
We propose a unified diffusion-based framework for joint drag-text image editing.<n>Our framework introduces two key innovations: (1) Point-Cloud Deterministic Drag, which enhances latent-space layout control through 3D feature mapping, and (2) Drag-Text Guided Denoising, dynamically balancing the influence of drag and text conditions during denoising.
arXiv Detail & Related papers (2025-09-26T05:39:03Z)
CannyEdit: Selective Canny Control and Dual-Prompt Guidance for Training-Free Image Editing [10.535939265557895]
CannyEdit is a novel training-free framework for regional image editing.<n>It applies structural guidance from a Canny ControlNet only to the unedited regions, preserving the original image's details.<n>CannyEdit offers exceptional flexibility: it operates effectively with rough masks or even single-point hints in addition tasks.
arXiv Detail & Related papers (2025-08-09T11:06:58Z)
DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics [71.78350994830885]
We present a novel approach to improving text-guided image editing using diffusion-based models. Our method uses visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task.
arXiv Detail & Related papers (2025-03-21T02:14:03Z)
TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach. We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength. We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z)
DragText: Rethinking Text Embedding in Point-based Image Editing [3.1923251959845214]
We show that during the progressive editing of an input image in a diffusion model, the text embedding remains constant. We propose DragText, which optimize text embedding in conjunction with the dragging process to pair with the modified image embedding.
arXiv Detail & Related papers (2024-07-25T07:57:55Z)
GenVideo: One-shot Target-image and Shape Aware Video Editing using T2I Diffusion Models [2.362412515574206]
We propose "GenVideo" for editing videos leveraging target-image aware T2I models. Our approach handles edits with target objects of varying shapes and sizes while maintaining the temporal consistency of the edit.
arXiv Detail & Related papers (2024-04-18T23:25:27Z)
TIP-Editor: An Accurate 3D Editor Following Both Text-Prompts And Image-Prompts [119.84478647745658]
TIPEditor is a 3D scene editing framework that accepts both text and image prompts and a 3D bounding box to specify the editing region. Experiments have demonstrated that TIP-Editor conducts accurate editing following the text and image prompts in the specified bounding box region.
arXiv Detail & Related papers (2024-01-26T12:57:05Z)
ZONE: Zero-Shot Instruction-Guided Local Editing [56.56213730578504]
We propose a Zero-shot instructiON-guided local image Editing approach, termed ZONE. We first convert the editing intent from the user-provided instruction into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model.
arXiv Detail & Related papers (2023-12-28T02:54:34Z)
Text-Driven Image Editing via Learnable Regions [74.45313434129005]
We introduce a method for region-based image editing driven by textual prompts, without the need for user-provided masks or sketches. We show that this simple approach enables flexible editing that is compatible with current image generation models. Experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions.
arXiv Detail & Related papers (2023-11-28T02:27:31Z)
FLATTEN: optical FLow-guided ATTENtion for consistent text-to-video editing [65.60744699017202]
We introduce optical flow into the attention module in the diffusion model's U-Net to address the inconsistency issue for text-to-video editing. Our method, FLATTEN, enforces the patches on the same flow path across different frames to attend to each other in the attention module. Results on existing text-to-video editing benchmarks show that our proposed method achieves the new state-of-the-art performance.
arXiv Detail & Related papers (2023-10-09T17:59:53Z)
Zero-shot Text-driven Physically Interpretable Face Editing [29.32334174584623]
This paper proposes a novel and physically interpretable method for face editing based on arbitrary text prompts. Our method can generate physically interpretable face editing results with high identity consistency and image quality.
arXiv Detail & Related papers (2023-08-11T07:20:24Z)
CLIP-Guided StyleGAN Inversion for Text-Driven Real Image Editing [22.40686064568406]
We present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds.
arXiv Detail & Related papers (2023-07-17T11:29:48Z)
StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images. They either finetune the model, or invert the image in the latent space of the pretrained model. They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z)
Edit-A-Video: Single Video Editing with Object-Aware Consistency [49.43316939996227]
We propose a video editing framework given only a pretrained TTI model and a single text, video> pair, which we term Edit-A-Video. The framework consists of two stages: (1) inflating the 2D model into the 3D model by appending temporal modules tuning and on the source video (2) inverting the source video into the noise and editing with target text prompt and attention map injection. We present extensive experimental results over various types of text and videos, and demonstrate the superiority of the proposed method compared to baselines in terms of background consistency, text alignment, and video editing quality.
arXiv Detail & Related papers (2023-03-14T14:35:59Z)
Shape-aware Text-driven Layered Video Editing [39.56765973770167]
We present a shape-aware, text-driven video editing method to handle shape changes. We first propagate the deformation field between the input and edited to all frames. We then leverage a pre-trained text-conditioned diffusion model as guidance for refining shape distortion and completing unseen regions.
arXiv Detail & Related papers (2023-01-30T18:41:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.