MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion
- URL: http://arxiv.org/abs/2412.20062v2
- Date: Wed, 15 Jan 2025 15:53:13 GMT
- Title: MADiff: Text-Guided Fashion Image Editing with Mask Prediction and Attention-Enhanced Diffusion
- Authors: Zechao Zhan, Dehong Gao, Jinxia Zhang, Jiale Huang, Yang Hu, Xin Wang,
- Abstract summary: The MADiff model is proposed to more accurately identify editing region.
The Attention-Enhanced Diffusion Model is proposed to strengthen the editing magnitude.
Our proposed method can accurately predict the mask of editing region and significantly enhance editing magnitude in fashion image editing.
- Score: 9.149799210311468
- License:
- Abstract: Text-guided image editing model has achieved great success in general domain. However, directly applying these models to the fashion domain may encounter two issues: (1) Inaccurate localization of editing region; (2) Weak editing magnitude. To address these issues, the MADiff model is proposed. Specifically, to more accurately identify editing region, the MaskNet is proposed, in which the foreground region, densepose and mask prompts from large language model are fed into a lightweight UNet to predict the mask for editing region. To strengthen the editing magnitude, the Attention-Enhanced Diffusion Model is proposed, where the noise map, attention map, and the mask from MaskNet are fed into the proposed Attention Processor to produce a refined noise map. By integrating the refined noise map into the diffusion model, the edited image can better align with the target prompt. Given the absence of benchmarks in fashion image editing, we constructed a dataset named Fashion-E, comprising 28390 image-text pairs in the training set, and 2639 image-text pairs for four types of fashion tasks in the evaluation set. Extensive experiments on Fashion-E demonstrate that our proposed method can accurately predict the mask of editing region and significantly enhance editing magnitude in fashion image editing compared to the state-of-the-art methods.
Related papers
- PartEdit: Fine-Grained Image Editing using Pre-Trained Diffusion Models [80.98455219375862]
We present the first text-based image editing approach for object parts based on pre-trained diffusion models.
Our approach is preferred by users 77-90% of the time in conducted user studies.
arXiv Detail & Related papers (2025-02-06T13:08:43Z) - BrushEdit: All-In-One Image Inpainting and Editing [79.55816192146762]
BrushEdit is a novel inpainting-based instruction-guided image editing paradigm.
We devise a system enabling free-form instruction editing by integrating MLLMs and a dual-branch image inpainting model.
Our framework effectively combines MLLMs and inpainting models, achieving superior performance across seven metrics.
arXiv Detail & Related papers (2024-12-13T17:58:06Z) - DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing [26.090574235851083]
We introduce a new fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit)
DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images.
To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism.
arXiv Detail & Related papers (2024-09-02T09:15:26Z) - TurboEdit: Text-Based Image Editing Using Few-Step Diffusion Models [53.757752110493215]
We focus on a popular line of text-based editing frameworks - the edit-friendly'' DDPM-noise inversion approach.
We analyze its application to fast sampling methods and categorize its failures into two classes: the appearance of visual artifacts, and insufficient editing strength.
We propose a pseudo-guidance approach that efficiently increases the magnitude of edits without introducing new artifacts.
arXiv Detail & Related papers (2024-08-01T17:27:28Z) - Towards Efficient Diffusion-Based Image Editing with Instant Attention
Masks [43.079272743475435]
In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing(InstDiffEdit)
In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps.
To supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods.
arXiv Detail & Related papers (2024-01-15T14:25:54Z) - MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based
Attention-Adjusted Guidance [28.212908146852197]
We develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios.
In particular, MAG-Edit optimize the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints.
arXiv Detail & Related papers (2023-12-18T17:55:44Z) - Customize your NeRF: Adaptive Source Driven 3D Scene Editing via
Local-Global Iterative Training [61.984277261016146]
We propose a CustomNeRF model that unifies a text description or a reference image as the editing prompt.
To tackle the first challenge, we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing.
For the second challenge, we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem.
arXiv Detail & Related papers (2023-12-04T06:25:06Z) - StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing [115.49488548588305]
A significant research effort is focused on exploiting the amazing capacities of pretrained diffusion models for the editing of images.
They either finetune the model, or invert the image in the latent space of the pretrained model.
They suffer from two problems: Unsatisfying results for selected regions and unexpected changes in non-selected regions.
arXiv Detail & Related papers (2023-03-28T00:16:45Z) - DiffEdit: Diffusion-based semantic image editing with mask guidance [64.555930158319]
DiffEdit is a method to take advantage of text-conditioned diffusion models for the task of semantic image editing.
Our main contribution is able to automatically generate a mask highlighting regions of the input image that need to be edited.
arXiv Detail & Related papers (2022-10-20T17:16:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.