LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
- URL: http://arxiv.org/abs/2305.13501v3
- Date: Thu, 3 Aug 2023 13:51:22 GMT
- Title: LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
- Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia,
Marco Bertini, Rita Cucchiara
- Abstract summary: This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task.
The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module.
We show that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
- Score: 35.4056826207203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek
innovative approaches to enhance the consumer experience. At the same time,
recent advancements in the development of diffusion models have enabled
generative networks to create remarkably realistic images. In this context,
image-based virtual try-on, which consists in generating a novel image of a
target model wearing a given in-shop garment, has yet to capitalize on the
potential of these powerful generative solutions. This work introduces
LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the
Virtual Try-ON task. The proposed architecture relies on a latent diffusion
model extended with a novel additional autoencoder module that exploits
learnable skip connections to enhance the generation process preserving the
model's characteristics. To effectively maintain the texture and details of the
in-shop garment, we propose a textual inversion component that can map the
visual features of the garment to the CLIP token embedding space and thus
generate a set of pseudo-word token embeddings capable of conditioning the
generation process. Experimental results on Dress Code and VITON-HD datasets
demonstrate that our approach outperforms the competitors by a consistent
margin, achieving a significant milestone for the task. Source code and trained
models are publicly available at: https://github.com/miccunifi/ladi-vton.
Related papers
- Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [62.06970466554273]
We present Meissonic, which non-autoregressive masked image modeling (MIM) text-to-image elevates to a level comparable with state-of-the-art diffusion models like SDXL.
We leverage high-quality training data, integrate micro-conditions informed by human preference scores, and employ feature compression layers to further enhance image fidelity and resolution.
Our model not only matches but often exceeds the performance of existing models like SDXL in generating high-quality, high-resolution images.
arXiv Detail & Related papers (2024-10-10T17:59:17Z) - ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - Towards Multi-Task Multi-Modal Models: A Video Generative Perspective [5.495245220300184]
This thesis chronicles our endeavor to build multi-task models for generating videos and other modalities under diverse conditions.
We unveil a novel approach to mapping bidirectionally between visual observation and interpretable lexical terms.
Our scalable visual token representation proves beneficial across generation, compression, and understanding tasks.
arXiv Detail & Related papers (2024-05-26T23:56:45Z) - FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion [11.646594594565098]
This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models.
We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data.
arXiv Detail & Related papers (2024-04-26T14:59:42Z) - Improving Diffusion Models for Authentic Virtual Try-on in the Wild [53.96244595495942]
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment.
We propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images.
We present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
arXiv Detail & Related papers (2024-03-08T08:12:18Z) - Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks.
We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception.
Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z) - UniDiff: Advancing Vision-Language Models with Generative and
Discriminative Learning [86.91893533388628]
This paper presents UniDiff, a unified multi-modal model that integrates image-text contrastive learning (ITC), text-conditioned image synthesis learning (IS), and reciprocal semantic consistency modeling (RSC)
UniDiff demonstrates versatility in both multi-modal understanding and generative tasks.
arXiv Detail & Related papers (2023-06-01T15:39:38Z) - Interactive Fashion Content Generation Using LLMs and Latent Diffusion
Models [0.0]
Fashionable image generation aims to synthesize images of diverse fashion prevalent around the globe.
We propose a method exploiting the equivalence between diffusion models and energy-based models (EBMs)
Our results indicate that using an LLM to refine the prompts to the latent diffusion model assists in generating globally creative and culturally diversified fashion styles.
arXiv Detail & Related papers (2023-05-15T18:38:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.