DiCTI: Diffusion-based Clothing Designer via Text-guided Input
- URL: http://arxiv.org/abs/2407.03901v1
- Date: Thu, 4 Jul 2024 12:48:36 GMT
- Title: DiCTI: Diffusion-based Clothing Designer via Text-guided Input
- Authors: Ajda Lampe, Julija Stopar, Deepak Kumar Jain, Shinichiro Omachi, Peter Peer, Vitomir Štruc,
- Abstract summary: DiCTI (Diffusion-based Clothing Designer via Text-guided Input) allows designers to quickly visualize fashion-related ideas using text inputs only.
By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs.
- Score: 5.275658744475251
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Recent developments in deep generative models have opened up a wide range of opportunities for image synthesis, leading to significant changes in various creative fields, including the fashion industry. While numerous methods have been proposed to benefit buyers, particularly in virtual try-on applications, there has been relatively less focus on facilitating fast prototyping for designers and customers seeking to order new designs. To address this gap, we introduce DiCTI (Diffusion-based Clothing Designer via Text-guided Input), a straightforward yet highly effective approach that allows designers to quickly visualize fashion-related ideas using text inputs only. Given an image of a person and a description of the desired garments as input, DiCTI automatically generates multiple high-resolution, photorealistic images that capture the expressed semantics. By leveraging a powerful diffusion-based inpainting model conditioned on text inputs, DiCTI is able to synthesize convincing, high-quality images with varied clothing designs that viably follow the provided text descriptions, while being able to process very diverse and challenging inputs, captured in completely unconstrained settings. We evaluate DiCTI in comprehensive experiments on two different datasets (VITON-HD and Fashionpedia) and in comparison to the state-of-the-art (SoTa). The results of our experiments show that DiCTI convincingly outperforms the SoTA competitor in generating higher quality images with more elaborate garments and superior text prompt adherence, both according to standard quantitative evaluation measures and human ratings, generated as part of a user study.
Related papers
- TextCraftor: Your Text Encoder Can be Image Quality Controller [65.27457900325462]
Diffusion-based text-to-image generative models, e.g., Stable Diffusion, have revolutionized the field of content generation.
We propose a proposed fine-tuning approach, TextCraftor, to enhance the performance of text-to-image diffusion models.
arXiv Detail & Related papers (2024-03-27T19:52:55Z) - Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
This paper tackles the task of multimodal-conditioned fashion image editing.
Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures.
arXiv Detail & Related papers (2024-03-21T20:43:10Z) - Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis [65.7968515029306]
We propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for Pose-Guided Person Image Synthesis (PGPIS)
A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt.
arXiv Detail & Related papers (2024-02-28T06:07:07Z) - Hierarchical Fashion Design with Multi-stage Diffusion Models [17.848891542772446]
Cross-modal fashion synthesis and editing offer intelligent support to fashion designers.
Current diffusion models demonstrate commendable stability and controllability in image synthesis.
We propose HieraFashDiff,a novel fashion design method using the shared multi-stage diffusion model.
arXiv Detail & Related papers (2024-01-15T03:38:57Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - Quality and Quantity: Unveiling a Million High-Quality Images for Text-to-Image Synthesis in Fashion Design [14.588884182004277]
We present the Fashion-Diffusion dataset, a product of multiple years' rigorous effort.
The dataset comprises over a million high-quality fashion images, paired with detailed text descriptions.
To foster standardization in the T2I-based fashion design field, we propose a new benchmark for evaluating the performance of fashion design models.
arXiv Detail & Related papers (2023-11-19T06:43:11Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Multimodal Garment Designer: Human-Centric Latent Diffusion Models for
Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images.
We tackle this problem by proposing a new architecture based on latent diffusion models.
Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z) - Unified Multi-Modal Latent Diffusion for Joint Subject and Text
Conditional Image Generation [63.061871048769596]
We present a novel Unified Multi-Modal Latent Diffusion (UMM-Diffusion) which takes joint texts and images containing specified subjects as input sequences.
To be more specific, both input texts and images are encoded into one unified multi-modal latent space.
Our method is able to generate high-quality images with complex semantics from both aspects of input texts and images.
arXiv Detail & Related papers (2023-03-16T13:50:20Z) - C-VTON: Context-Driven Image-Based Virtual Try-On Network [1.0832844764942349]
We propose a Context-Driven Virtual Try-On Network (C-VTON) that convincingly transfers selected clothing items to the target subjects.
At the core of the C-VTON pipeline are: (i) a geometric matching procedure that efficiently aligns the target clothing with the pose of the person in the input images, and (ii) a powerful image generator that utilizes various types of contextual information when the final try-on result.
arXiv Detail & Related papers (2022-12-08T17:56:34Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.