Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing
- URL: http://arxiv.org/abs/2304.02051v2
- Date: Wed, 23 Aug 2023 12:45:27 GMT
- Title: Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing
- Authors: Alberto Baldrati, Davide Morelli, Giuseppe Cartella, Marcella Cornia,
Marco Bertini, Rita Cucchiara
- Abstract summary: We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
- Score: 40.70752781891058
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Fashion illustration is used by designers to communicate their vision and to
bring the design idea from conceptualization to realization, showing how
clothes interact with the human body. In this context, computer vision can thus
be used to improve the fashion design process. Differently from previous works
that mainly focused on the virtual try-on of garments, we propose the task of
multimodal-conditioned fashion image editing, guiding the generation of
human-centric fashion images by following multimodal prompts, such as text,
human body poses, and garment sketches. We tackle this problem by proposing a
new architecture based on latent diffusion models, an approach that has not
been used before in the fashion domain. Given the lack of existing datasets
suitable for the task, we also extend two existing fashion datasets, namely
Dress Code and VITON-HD, with multimodal annotations collected in a
semi-automatic manner. Experimental results on these new datasets demonstrate
the effectiveness of our proposal, both in terms of realism and coherence with
the given multimodal inputs. Source code and collected multimodal annotations
are publicly available at:
https://github.com/aimagelab/multimodal-garment-designer.
Related papers
- UniFashion: A Unified Vision-Language Model for Multimodal Fashion Retrieval and Generation [29.489516715874306]
We present UniFashion, a unified framework that simultaneously tackles the challenges of multimodal generation and retrieval tasks within the fashion domain.
Our model significantly outperforms previous single-task state-of-the-art models across diverse fashion tasks.
arXiv Detail & Related papers (2024-08-21T03:17:20Z) - FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion [11.646594594565098]
This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models.
We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data.
arXiv Detail & Related papers (2024-04-26T14:59:42Z) - Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
This paper tackles the task of multimodal-conditioned fashion image editing.
Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures.
arXiv Detail & Related papers (2024-03-21T20:43:10Z) - Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design [14.588884182004277]
We present the Fashion-Diffusion dataset, a product of multiple years' rigorous effort.
The dataset comprises over a million high-quality fashion images, paired with detailed text descriptions.
To foster standardization in the T2I-based fashion design field, we propose a new benchmark for evaluating the performance of fashion design models.
arXiv Detail & Related papers (2023-11-19T06:43:11Z) - Fashion Matrix: Editing Photos by Just Talking [66.83502497764698]
We develop a hierarchical AI system called Fashion Matrix dedicated to editing photos by just talking.
Fashion Matrix employs Large Language Models (LLMs) as its foundational support and engages in iterative interactions with users.
Visual Foundation Models are leveraged to generate edited images from text prompts and masks, thereby facilitating the automation of fashion editing processes.
arXiv Detail & Related papers (2023-07-25T04:06:25Z) - DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion [63.179505586264014]
We present DreamPose, a diffusion-based method for generating animated fashion videos from still images.
Given an image and a sequence of human body poses, our method synthesizes a video containing both human and fabric motion.
arXiv Detail & Related papers (2023-04-12T17:59:17Z) - FaD-VLP: Fashion Vision-and-Language Pre-training towards Unified
Retrieval and Captioning [66.38951790650887]
Multimodal tasks in the fashion domain have significant potential for e-commerce.
We propose a novel fashion-specific pre-training framework based on weakly-supervised triplets constructed from fashion image-text pairs.
We show the triplet-based tasks are an effective addition to standard multimodal pre-training tasks.
arXiv Detail & Related papers (2022-10-26T21:01:19Z) - M6-Fashion: High-Fidelity Multi-modal Image Generation and Editing [51.033376763225675]
We adapt style prior knowledge and flexibility of multi-modal control into one unified two-stage framework, M6-Fashion, focusing on the practical AI-aided Fashion design.
M6-Fashion utilizes self-correction for the non-autoregressive generation to improve inference speed, enhance holistic consistency, and support various signal controls.
arXiv Detail & Related papers (2022-05-24T01:18:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.