Related papers: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On

URL: http://arxiv.org/abs/2505.16977v1
Date: Thu, 22 May 2025 17:52:13 GMT
Title: Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On
Authors: Siqi Wan, Jingwen Chen, Yingwei Pan, Ting Yao, Tao Mei,
Abstract summary: Diffusion models have shown success in virtual try-on (VTON) task.<n>The problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsicity of diffusion model.<n>We propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process.
Score: 89.9123806553489
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Diffusion models have shown preliminary success in virtual try-on (VTON) task. The typical dual-branch architecture comprises two UNets for implicit garment deformation and synthesized image generation respectively, and has emerged as the recipe for VTON task. Nevertheless, the problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsic stochasticity of diffusion model. To alleviate this issue, we novelly propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process instead of simply feeding the whole garment into UNet as the appearance reference. Specifically, we interpret the fine-grained appearance and texture details as a set of structured semantic points, and match the semantic points rooted in garment to the ones over target person through local flow warping. Such 2D points are then augmented into 3D-aware cues with depth/normal map of target person. The correspondence mimics the way of putting clothing on human body and the 3D-aware cues act as semantic point matching to supervise diffusion model training. A point-focused diffusion loss is further devised to fully take the advantage of semantic point matching. Extensive experiments demonstrate strong garment detail preservation of our approach, evidenced by state-of-the-art VTON performances on both VITON-HD and DressCode datasets. Code is publicly available at: https://github.com/HiDream-ai/SPM-Diff.

Related papers

SSGaussian: Semantic-Aware and Structure-Preserving 3D Style Transfer [57.723850794113055]
We propose a novel 3D style transfer pipeline that integrates prior knowledge from pretrained 2D diffusion models.<n>Our pipeline consists of two key stages: First, we leverage diffusion priors to generate stylized renderings of key viewpoints.<n>The second is instance-level style transfer, which effectively leverages instance-level consistency across stylized key views and transfers it onto the 3D representation.
arXiv Detail & Related papers (2025-09-04T16:40:44Z)
Learning to Align and Refine: A Foundation-to-Diffusion Framework for Occlusion-Robust Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a dual-stage Foundation-to-Diffusion framework that precisely align 2D prior guidance from vision foundation models.
arXiv Detail & Related papers (2025-03-22T14:42:27Z)
Improving Virtual Try-On with Garment-focused Diffusion Models [91.95830983115474]
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. We shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process. Experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.
arXiv Detail & Related papers (2024-09-12T17:55:11Z)
GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon [5.790630195329777]
We introduce a novel graph based warping technique which emphasizes the value of context in garment flow. Our method, validated on VITON-HD and Dresscode datasets, showcases substantial improvement in garment warping, texture preservation, and overall realism.
arXiv Detail & Related papers (2024-06-04T10:29:18Z)
FLDM-VTON: Faithful Latent Diffusion Model for Virtual Try-on [21.34959824429241]
FLDM-VTON is a novel Faithful Latent Diffusion Model for VTON. It incorporates clothes as both the starting point and local condition, supplying the model with faithful clothes priors. It is able to generate photo-realistic try-on images with faithful clothing details.
arXiv Detail & Related papers (2024-04-22T13:21:09Z)
Improving Diffusion Models for Authentic Virtual Try-on in the Wild [53.96244595495942]
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment. We propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images. We present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
arXiv Detail & Related papers (2024-03-08T08:12:18Z)
Harnessing Diffusion Models for Visual Perception with Meta Prompts [68.78938846041767]
We propose a simple yet effective scheme to harness a diffusion model for visual perception tasks. We introduce learnable embeddings (meta prompts) to the pre-trained diffusion models to extract proper features for perception. Our approach achieves new performance records in depth estimation tasks on NYU depth V2 and KITTI, and in semantic segmentation task on CityScapes.
arXiv Detail & Related papers (2023-12-22T14:40:55Z)
StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On [35.227896906556026]
Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work, we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process.
arXiv Detail & Related papers (2023-12-04T08:27:59Z)
3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models [102.75875255071246]
3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. We propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models.
arXiv Detail & Related papers (2023-11-09T15:51:27Z)

This list is automatically generated from the titles and abstracts of the papers in this site.