StableVITON: Learning Semantic Correspondence with Latent Diffusion
Model for Virtual Try-On
- URL: http://arxiv.org/abs/2312.01725v1
- Date: Mon, 4 Dec 2023 08:27:59 GMT
- Title: StableVITON: Learning Semantic Correspondence with Latent Diffusion
Model for Virtual Try-On
- Authors: Jeongho Kim, Gyojung Gu, Minho Park, Sunghyun Park, and Jaegul Choo
- Abstract summary: Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image.
In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task.
Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process.
- Score: 35.227896906556026
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Given a clothing image and a person image, an image-based virtual try-on aims
to generate a customized image that appears natural and accurately reflects the
characteristics of the clothing image. In this work, we aim to expand the
applicability of the pre-trained diffusion model so that it can be utilized
independently for the virtual try-on task.The main challenge is to preserve the
clothing details while effectively utilizing the robust generative capability
of the pre-trained model. In order to tackle these issues, we propose
StableVITON, learning the semantic correspondence between the clothing and the
human body within the latent space of the pre-trained diffusion model in an
end-to-end manner. Our proposed zero cross-attention blocks not only preserve
the clothing details by learning the semantic correspondence but also generate
high-fidelity images by utilizing the inherent knowledge of the pre-trained
model in the warping process. Through our proposed novel attention total
variation loss and applying augmentation, we achieve the sharp attention map,
resulting in a more precise representation of clothing details. StableVITON
outperforms the baselines in qualitative and quantitative evaluation, showing
promising quality in arbitrary person images. Our code is available at
https://github.com/rlawjdghek/StableVITON.
Related papers
- Improving Virtual Try-On with Garment-focused Diffusion Models [91.95830983115474]
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks.
We shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process.
Experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.
arXiv Detail & Related papers (2024-09-12T17:55:11Z) - IMAGDressing-v1: Customizable Virtual Dressing [58.44155202253754]
IMAGDressing-v1 is a virtual dressing task that generates freely editable human images with fixed garments and optional conditions.
IMAGDressing-v1 incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE.
We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet.
arXiv Detail & Related papers (2024-07-17T16:26:30Z) - Masked Extended Attention for Zero-Shot Virtual Try-On In The Wild [17.025262797698364]
Virtual Try-On aims to replace a piece of garment in an image with one from another, while preserving person and garment characteristics as well as image fidelity.
Current literature takes a supervised approach for the task, impairing generalization and imposing heavy computation.
We present a novel zero-shot training-free method for inpainting a clothing garment by reference.
arXiv Detail & Related papers (2024-06-21T17:45:37Z) - GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon [5.790630195329777]
We introduce a novel graph based warping technique which emphasizes the value of context in garment flow.
Our method, validated on VITON-HD and Dresscode datasets, showcases substantial improvement in garment warping, texture preservation, and overall realism.
arXiv Detail & Related papers (2024-06-04T10:29:18Z) - Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On [29.217423805933727]
Diffusion model-based approaches have recently become popular, as they are excellent at image synthesis tasks.
We propose an Texture-Preserving Diffusion (TPD) model for virtual try-on, which enhances the fidelity of the results.
Second, we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images.
arXiv Detail & Related papers (2024-04-01T12:43:22Z) - Improving Diffusion Models for Authentic Virtual Try-on in the Wild [53.96244595495942]
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment.
We propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images.
We present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
arXiv Detail & Related papers (2024-03-08T08:12:18Z) - Taming the Power of Diffusion Models for High-Quality Virtual Try-On
with Appearance Flow [24.187109053871833]
Virtual try-on is a critical image synthesis task that aims to transfer clothes from one image to another while preserving the details of both humans and clothes.
We propose an exemplar-based inpainting approach that leverages a warping module to guide the diffusion model's generation effectively.
Our approach, namely Diffusion-based Conditional Inpainting for Virtual Try-ON (DCI-VTON), effectively utilizes the power of the diffusion model.
arXiv Detail & Related papers (2023-08-11T12:23:09Z) - Person Image Synthesis via Denoising Diffusion Model [116.34633988927429]
We show how denoising diffusion models can be applied for high-fidelity person image synthesis.
Our results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios.
arXiv Detail & Related papers (2022-11-22T18:59:50Z) - PoNA: Pose-guided Non-local Attention for Human Pose Transfer [105.14398322129024]
We propose a new human pose transfer method using a generative adversarial network (GAN) with simplified cascaded blocks.
Our model generates sharper and more realistic images with rich details, while having fewer parameters and faster speed.
arXiv Detail & Related papers (2020-12-13T12:38:29Z) - Towards Photo-Realistic Virtual Try-On by Adaptively
Generating$\leftrightarrow$Preserving Image Content [85.24260811659094]
We propose a novel visual try-on network, namely Adaptive Content Generating and Preserving Network (ACGPN)
ACGPN first predicts semantic layout of the reference image that will be changed after try-on.
Second, a clothes warping module warps clothing images according to the generated semantic layout.
Third, an inpainting module for content fusion integrates all information (e.g. reference image, semantic layout, warped clothes) to adaptively produce each semantic part of human body.
arXiv Detail & Related papers (2020-03-12T15:55:39Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.