Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner
- URL: http://arxiv.org/abs/2406.00432v2
- Date: Tue, 22 Oct 2024 07:59:51 GMT
- Title: Localize, Understand, Collaborate: Semantic-Aware Dragging via Intention Reasoner
- Authors: Xing Cui, Peipei Li, Zekun Li, Xuannan Liu, Yueying Zou, Zhaofeng He,
- Abstract summary: Current methods typically model this problem as automatically learning "how to drag" through point dragging.
We propose LucidDrag, which shifts the focus from "how to drag" to "what-then-how" paradigm.
- Score: 8.310002338000954
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Flexible and accurate drag-based editing is a challenging task that has recently garnered significant attention. Current methods typically model this problem as automatically learning "how to drag" through point dragging and often produce one deterministic estimation, which presents two key limitations: 1) Overlooking the inherently ill-posed nature of drag-based editing, where multiple results may correspond to a given input, as illustrated in Fig.1; 2) Ignoring the constraint of image quality, which may lead to unexpected distortion. To alleviate this, we propose LucidDrag, which shifts the focus from "how to drag" to "what-then-how" paradigm. LucidDrag comprises an intention reasoner and a collaborative guidance sampling mechanism. The former infers several optimal editing strategies, identifying what content and what semantic direction to be edited. Based on the former, the latter addresses "how to drag" by collaboratively integrating existing editing guidance with the newly proposed semantic guidance and quality guidance. Specifically, semantic guidance is derived by establishing a semantic editing direction based on reasoned intentions, while quality guidance is achieved through classifier guidance using an image fidelity discriminator. Both qualitative and quantitative comparisons demonstrate the superiority of LucidDrag over previous methods.
Related papers
- AdaptiveDrag: Semantic-Driven Dragging on Diffusion-Based Image Editing [14.543341303789445]
We propose a novel mask-free point-based image editing method, AdaptiveDrag, which generates images that better align with user intent.
To ensure a comprehensive connection between the input image and the drag process, we have developed a semantic-driven optimization.
Building on these effective designs, our method delivers superior generation results using only the single input image and the handle-target point pairs.
arXiv Detail & Related papers (2024-10-16T15:59:02Z) - Combing Text-based and Drag-based Editing for Precise and Flexible Image Editing [9.398831289389749]
We propose textbfCLIPDrag, a novel image editing method that combines text and drag signals for precise and ambiguity-free manipulations.
CLIPDrag outperforms existing single drag-based methods or text-based methods.
arXiv Detail & Related papers (2024-10-04T02:46:09Z) - Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance [62.15866177242207]
We show that through constructing a subject-agnostic condition, one could obtain outputs consistent with both the given subject and input text prompts.
Our approach is conceptually simple and requires only minimal code modifications, but leads to substantial quality improvements.
arXiv Detail & Related papers (2024-05-02T15:03:41Z) - StableDrag: Stable Dragging for Point-based Image Editing [24.924112878074336]
Point-based image editing has attracted remarkable attention since the emergence of DragGAN.
Recently, DragDiffusion further pushes forward the generative quality via adapting this dragging technique to diffusion models.
We build a stable and precise drag-based editing framework, coined as StableDrag, by designing a discirminative point tracking method and a confidence-based latent enhancement strategy for motion supervision.
arXiv Detail & Related papers (2024-03-07T12:11:02Z) - MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing [90.06041718086317]
We propose a unified Multi-alignment Diffusion, dubbed as MagDiff, for both tasks of high-fidelity video generation and editing.
The proposed MagDiff introduces three types of alignments, including subject-driven alignment, adaptive prompts alignment, and high-fidelity alignment.
arXiv Detail & Related papers (2023-11-29T03:36:07Z) - FreeDrag: Feature Dragging for Reliable Point-based Image Editing [16.833998026980087]
We propose FreeDrag, a feature dragging methodology designed to free the burden on point tracking.
The FreeDrag incorporates two key designs, i.e., template feature via adaptive updating and line search with backtracking.
Our approach significantly outperforms pre-existing methodologies, offering reliable point-based editing even in various complex scenarios.
arXiv Detail & Related papers (2023-07-10T16:37:46Z) - DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing [94.24479528298252]
DragGAN is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision.
By harnessing large-scale pretrained diffusion models, we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images.
We present a challenging benchmark dataset called DragBench to evaluate the performance of interactive point-based image editing methods.
arXiv Detail & Related papers (2023-06-26T06:04:09Z) - Drag Your GAN: Interactive Point-based Manipulation on the Generative Image Manifold [79.94300820221996]
DragGAN is a new way of controlling generative adversarial networks (GANs)
DragGAN allows anyone to deform an image with precise control over where pixels go, thus manipulating the pose, shape, expression, and layout of diverse categories such as animals, cars, humans, landscapes, etc.
Both qualitative and quantitative comparisons demonstrate the advantage of DragGAN over prior approaches in the tasks of image manipulation and point tracking.
arXiv Detail & Related papers (2023-05-18T13:41:25Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [86.92711729969488]
We exploit the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions, and unexpected changes in nonselected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - Towards Counterfactual Image Manipulation via CLIP [106.94502632502194]
Existing methods can achieve realistic editing of different visual attributes such as age and gender of facial images.
We investigate this problem in a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP)
We design a novel contrastive loss that exploits predefined CLIP-space directions to guide the editing toward desired directions from different perspectives.
arXiv Detail & Related papers (2022-07-06T17:02:25Z) - Gradient-guided Unsupervised Text Style Transfer via Contrastive
Learning [6.799826701166569]
We propose a gradient-guided model through a contrastive paradigm for text style transfer, to explicitly gather similar semantic sentences.
Experiments on two datasets show the effectiveness of our proposed approach, as compared to the state-of-the-arts.
arXiv Detail & Related papers (2022-01-23T12:45:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.