LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
- URL: http://arxiv.org/abs/2305.13501v3
- Date: Thu, 3 Aug 2023 13:51:22 GMT
- Title: LaDI-VTON: Latent Diffusion Textual-Inversion Enhanced Virtual Try-On
- Authors: Davide Morelli, Alberto Baldrati, Giuseppe Cartella, Marcella Cornia,
Marco Bertini, Rita Cucchiara
- Abstract summary: This work introduces LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the Virtual Try-ON task.
The proposed architecture relies on a latent diffusion model extended with a novel additional autoencoder module.
We show that our approach outperforms the competitors by a consistent margin, achieving a significant milestone for the task.
- Score: 35.4056826207203
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The rapidly evolving fields of e-commerce and metaverse continue to seek
innovative approaches to enhance the consumer experience. At the same time,
recent advancements in the development of diffusion models have enabled
generative networks to create remarkably realistic images. In this context,
image-based virtual try-on, which consists in generating a novel image of a
target model wearing a given in-shop garment, has yet to capitalize on the
potential of these powerful generative solutions. This work introduces
LaDI-VTON, the first Latent Diffusion textual Inversion-enhanced model for the
Virtual Try-ON task. The proposed architecture relies on a latent diffusion
model extended with a novel additional autoencoder module that exploits
learnable skip connections to enhance the generation process preserving the
model's characteristics. To effectively maintain the texture and details of the
in-shop garment, we propose a textual inversion component that can map the
visual features of the garment to the CLIP token embedding space and thus
generate a set of pseudo-word token embeddings capable of conditioning the
generation process. Experimental results on Dress Code and VITON-HD datasets
demonstrate that our approach outperforms the competitors by a consistent
margin, achieving a significant milestone for the task. Source code and trained
models are publicly available at: https://github.com/miccunifi/ladi-vton.
Related papers
- Exploring Representation-Aligned Latent Space for Better Generation [86.45670422239317]
We introduce ReaLS, which integrates semantic priors to improve generation performance.
We show that fundamental DiT and SiT trained on ReaLS can achieve a 15% improvement in FID metric.
The enhanced semantic latent space enables more perceptual downstream tasks, such as segmentation and depth estimation.
arXiv Detail & Related papers (2025-02-01T07:42:12Z) - ITVTON:Virtual Try-On Diffusion Transformer Model Based on Integrated Image and Text [0.0]
We introduce ITVTON, a method that enhances clothing-character interactions by combining clothing and character images along spatial channels as inputs.
We incorporate integrated textual descriptions from multiple images to boost the realism of the generated visual effects.
In experiments, ITVTON outperforms baseline methods both qualitatively and quantitatively.
arXiv Detail & Related papers (2025-01-28T07:24:15Z) - ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement [49.513401043490305]
This work explores the continual general pre-training of text-to-video models.
We break this task into two key aspects: increasing model capacity and improving semantic understanding.
For semantic understanding, we propose a method that leverages large language models as advanced text encoders.
arXiv Detail & Related papers (2024-12-25T18:58:07Z) - ACDiT: Interpolating Autoregressive Conditional Modeling and Diffusion Transformer [95.80384464922147]
Continuous visual generation requires the full-sequence diffusion-based approach.
We present ACDiT, an Autoregressive blockwise Conditional Diffusion Transformer.
We demonstrate that ACDiT can be seamlessly used in visual understanding tasks despite being trained on the diffusion objective.
arXiv Detail & Related papers (2024-12-10T18:13:20Z) - TryOffDiff: Virtual-Try-Off via High-Fidelity Garment Reconstruction using Diffusion Models [8.158200403139196]
This paper introduces Virtual Try-Off (VTOFF), a novel task focused on generating standardized garment images from single photos of clothed individuals.
We present TryOffDiff, a model that adapts Stable Diffusion with SigLIP-based visual conditioning to ensure high fidelity and detail retention.
Our results highlight the potential of VTOFF to enhance product imagery in e-commerce applications, advance generative model evaluation, and inspire future work on high-fidelity reconstruction.
arXiv Detail & Related papers (2024-11-27T13:53:09Z) - LaVin-DiT: Large Vision Diffusion Transformer [99.98106406059333]
LaVin-DiT is a scalable and unified foundation model designed to tackle over 20 computer vision tasks in a generative framework.
We introduce key innovations to optimize generative performance for vision tasks.
The model is scaled from 0.1B to 3.4B parameters, demonstrating substantial scalability and state-of-the-art performance across diverse vision tasks.
arXiv Detail & Related papers (2024-11-18T12:05:27Z) - ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer [40.32254040909614]
We propose ACE, an All-round Creator and Editor, for visual generation tasks.
We first introduce a unified condition format termed Long-context Condition Unit (LCU)
We then propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks.
arXiv Detail & Related papers (2024-09-30T17:56:27Z) - FashionSD-X: Multimodal Fashion Garment Synthesis using Latent Diffusion [11.646594594565098]
This study introduces a novel generative pipeline designed to transform the fashion design process by employing latent diffusion models.
We leverage and enhance state-of-the-art virtual try-on datasets, including Multimodal Dress Code and VITON-HD, by integrating sketch data.
arXiv Detail & Related papers (2024-04-26T14:59:42Z) - SODA: Bottleneck Diffusion Models for Representation Learning [75.7331354734152]
We introduce SODA, a self-supervised diffusion model, designed for representation learning.
The model incorporates an image encoder, which distills a source view into a compact representation, that guides the generation of related novel views.
We show that by imposing a tight bottleneck between the encoder and a denoising decoder, we can turn diffusion models into strong representation learners.
arXiv Detail & Related papers (2023-11-29T18:53:34Z) - RenAIssance: A Survey into AI Text-to-Image Generation in the Era of
Large Model [93.8067369210696]
Text-to-image generation (TTI) refers to the usage of models that could process text input and generate high fidelity images based on text descriptions.
Diffusion models are one prominent type of generative model used for the generation of images through the systematic introduction of noises with repeating steps.
In the era of large models, scaling up model size and the integration with large language models have further improved the performance of TTI models.
arXiv Detail & Related papers (2023-09-02T03:27:20Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.