Related papers: FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion

FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion

URL: http://arxiv.org/abs/2404.18591v1
Date: Fri, 26 Apr 2024 14:59:42 GMT
Title: FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion
Authors: Abhishek Kumar Singh, Ioannis Patras,
Abstract summary: This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data.
Score: 11.646594594565098
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The rapid evolution of the fashion industry increasingly intersects with technological advancements, particularly through the integration of generative AI. This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models. Utilizing ControlNet and LoRA fine-tuning, our approach generates high-quality images from multimodal inputs such as text and sketches. We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data. Our evaluation, utilizing metrics like FID, CLIP Score, and KID, demonstrates that our model significantly outperforms traditional stable diffusion models. The results not only highlight the effectiveness of our model in generating fashion-appropriate outputs but also underscore the potential of diffusion models in revolutionizing fashion design workflows. This research paves the way for more interactive, personalized, and technologically enriched methodologies in fashion design and representation, bridging the gap between creative vision and practical application.

Related papers

Mogao: An Omni Foundation Model for Interleaved Multi-Modal Generation [54.588082888166504]
We present Mogao, a unified framework that enables interleaved multi-modal generation through a causal approach.<n>Mogoo integrates a set of key technical improvements in architecture design, including a deep-fusion design, dual vision encoders, interleaved rotary position embeddings, and multi-modal classifier-free guidance.<n>Experiments show that Mogao achieves state-of-the-art performance in multi-modal understanding and text-to-image generation, but also excels in producing high-quality, coherent interleaved outputs.
arXiv Detail & Related papers (2025-05-08T17:58:57Z)
Cross-Cultural Fashion Design via Interactive Large Language Models and Diffusion Models [0.0]
Fashion content generation is an emerging area at the intersection of artificial intelligence and creative design. Existing methods struggle with cultural bias, limited scalability, and alignment between textual prompts and generated visuals. We propose a novel framework that integrates Large Language Models (LLMs) with Latent Diffusion Models (LDMs) to address these challenges.
arXiv Detail & Related papers (2025-01-26T15:57:16Z)
DiffusionTrend: A Minimalist Approach to Virtual Fashion Try-On [103.89972383310715]
DiffusionTrend harnesses latent information rich in prior information to capture the nuances of garment details. It delivers a visually compelling try-on experience, underscoring the potential of training-free diffusion model.
arXiv Detail & Related papers (2024-12-19T02:24:35Z)
Automatic Generation of Fashion Images using Prompting in Generative Machine Learning Models [1.8817715864806608]
This work investigates methodologies for generating tailored fashion descriptions using two distinct Large Language Models and a Stable Diffusion model for fashion image creation. Emphasizing adaptability in AI-driven fashion creativity, we focus on prompting techniques, such as zero-shot and few-shot learning. Evaluation combines quantitative metrics such as CLIPscore with qualitative human judgment, highlighting strengths in creativity, coherence, and aesthetic appeal across diverse styles.
arXiv Detail & Related papers (2024-07-20T17:37:51Z)
YaART: Yet Another ART Rendering Technology [119.09155882164573]
This study introduces YaART, a novel production-grade text-to-image cascaded diffusion model aligned to human preferences. We analyze how these choices affect both the efficiency of the training process and the quality of the generated images. We demonstrate that models trained on smaller datasets of higher-quality images can successfully compete with those trained on larger datasets.
arXiv Detail & Related papers (2024-04-08T16:51:19Z)
Multimodal-Conditioned Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
This paper tackles the task of multimodal-conditioned fashion image editing. Our approach aims to generate human-centric fashion images guided by multimodal prompts, including text, human body poses, garment sketches, and fabric textures.
arXiv Detail & Related papers (2024-03-21T20:43:10Z)
StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data [129.92449761766025]
We propose a novel data collection methodology that synchronously synthesizes images and dialogues for visual instruction tuning. This approach harnesses the power of generative models, marrying the abilities of ChatGPT and text-to-image generative models. Our research includes comprehensive experiments conducted on various datasets.
arXiv Detail & Related papers (2023-08-20T12:43:52Z)
LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On [35.4056826207203]
This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task. The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module. We show that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
arXiv Detail & Related papers (2023-05-22T21:38:06Z)
Interactive Fashion Content Generation Using LLMs and Latent Diffusion Models [0.0]
Fashionable image generation aims to synthesize images of diverse fashion prevalent around the globe. We propose a method exploiting the equivalence between diffusion models and energy-based models (EBMs) Our results indicate that using an LLM to refine the prompts to the latent diffusion model assists in generating globally creative and culturally diversified fashion styles.
arXiv Detail & Related papers (2023-05-15T18:38:25Z)
Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing [40.70752781891058]
We propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images. We tackle this problem by proposing a new architecture based on latent diffusion models. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets.
arXiv Detail & Related papers (2023-04-04T18:03:04Z)
IRGen: Generative Modeling for Image Retrieval [82.62022344988993]
In this paper, we present a novel methodology, reframing image retrieval as a variant of generative modeling. We develop our model, dubbed IRGen, to address the technical challenge of converting an image into a concise sequence of semantic units. Our model achieves state-of-the-art performance on three widely-used image retrieval benchmarks and two million-scale datasets.
arXiv Detail & Related papers (2023-03-17T17:07:36Z)
Unified Discrete Diffusion for Simultaneous Vision-Language Generation [78.21352271140472]
We present a unified multimodal generation model that can conduct both the "modality translation" and "multi-modality generation" tasks. Specifically, we unify the discrete diffusion process for multimodal signals by proposing a unified transition matrix. Our proposed method can perform comparably to the state-of-the-art solutions in various generation tasks.
arXiv Detail & Related papers (2022-11-27T14:46:01Z)
Draw Your Art Dream: Diverse Digital Art Synthesis with Multimodal Guided Diffusion [78.47285788155818]
Current digital art synthesis methods usually use single-modality inputs as guidance. We propose the multimodal guided artwork diffusion (MGAD) model, which is a diffusion-based digital artwork generation approach.
arXiv Detail & Related papers (2022-09-27T13:10:25Z)
DIME: Fine-grained Interpretations of Multimodal Models via Disentangled Local Explanations [119.1953397679783]
We focus on advancing the state-of-the-art in interpreting multimodal models. Our proposed approach, DIME, enables accurate and fine-grained analysis of multimodal models.
arXiv Detail & Related papers (2022-03-03T20:52:47Z)

This list is automatically generated from the titles and abstracts of the papers in this site.