Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing
- URL: http://arxiv.org/abs/2403.14828v2
- Date: Mon, 25 Mar 2024 10:12:46 GMT
- Title: Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing
- Authors: Alberto Baldrati, Davide Morelli, Marcella Cornia, Marco Bertini, Rita Cucchiara,
- Abstract summary: This paper tackles the task of multimodal-conditioned fashion image editing.
Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures.
- Score: 40.70752781891058
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fashion illustration is a crucial medium for designers to convey their creative vision and transform design concepts into tangible representations that showcase the interplay between clothing and the human body. In the context of fashion design, computer vision techniques have the potential to enhance and streamline the design process. Departing from prior research primarily focused on virtual try-on, this paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures. To address this problem, we propose extending latent diffusion models to incorporate these multiple modalities and modifying the structure of the denoising network, taking multimodal prompts as input. To condition the proposed architecture on fabric textures, we employ textual inversion techniques and let diverse cross-attention layers of the denoising network attend to textual and texture information, thus incorporating different granularity conditioning details. Given the lack of datasets for the task, we extend two existing fashion datasets, Dress Code and VITON-HD, with multimodal annotations. Experimental evaluations demonstrate the effectiveness of our proposed approach in terms of realism and coherence concerning the provided multimodal inputs.
Related papers
- DPDEdit: Detail-Preserved Diffusion Models for Multimodal Fashion Image Editing [26.090574235851083]
We introduce a new fashion image editing architecture based on latent diffusion models, called Detail-Preserved Diffusion Models (DPDEdit)
DPDEdit guides the fashion image generation of diffusion models by integrating text prompts, region masks, human pose images, and garment texture images.
To transfer the detail of the given garment texture into the target fashion image, we propose a texture injection and refinement mechanism.
arXiv Detail & Related papers (2024-09-02T09:15:26Z) - UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation [29.489516715874306]
We present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain.
Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks.
arXiv Detail & Related papers (2024-08-21T03:17:20Z) - FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion [11.646594594565098]
This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models.
We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data.
arXiv Detail & Related papers (2024-04-26T14:59:42Z) - CreativeSynth: Creative Blending and Synthesis of Visual Arts based on
Multimodal Diffusion [74.44273919041912]
Large-scale text-to-image generative models have made impressive strides, showcasing their ability to synthesize a vast array of high-quality images.
However, adapting these models for artistic image editing presents two significant challenges.
We build the innovative unified framework Creative Synth, which is based on a diffusion model with the ability to coordinate multimodal inputs.
arXiv Detail & Related papers (2024-01-25T10:42:09Z) - Hierarchical Fashion Design with Multi-stage Diffusion Models [17.848891542772446]
Cross-modal fashion synthesis and editing offer intelligent support to fashion designers.
Current diffusion models demonstrate commendable stability and controllability in image synthesis.
We propose HieraFashDiff,a novel fashion design method using the shared multi-stage diffusion model.
arXiv Detail & Related papers (2024-01-15T03:38:57Z) - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z) - Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks.
Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix.
Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - Multimodal Image Synthesis and Editing: The Generative AI Era [131.9569600472503]
multimodal image synthesis and editing has become a hot research topic in recent years.
We comprehensively contextualize the advance of the recent multimodal image synthesis and editing.
We describe benchmark datasets and evaluation metrics as well as corresponding experimental results.
arXiv Detail & Related papers (2021-12-27T10:00:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.