DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation
- URL: http://arxiv.org/abs/2506.00908v1
- Date: Sun, 01 Jun 2025 08:52:57 GMT
- Title: DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation
- Authors: Xianbing Sun, Yan Hong, Jiahui Zhan, Jun Lan, Huijia Zhu, Weiqiang Wang, Liqing Zhang, Jianfu Zhang,
- Abstract summary: DS-VTON is a dual-scale virtual try-on framework that disentangles objectives for more effective modeling.<n>Our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks.
- Score: 38.499761393356124
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Despite recent progress, most existing virtual try-on methods still struggle to simultaneously address two core challenges: accurately aligning the garment image with the target human body, and preserving fine-grained garment textures and patterns. In this paper, we propose DS-VTON, a dual-scale virtual try-on framework that explicitly disentangles these objectives for more effective modeling. DS-VTON consists of two stages: the first stage generates a low-resolution try-on result to capture the semantic correspondence between garment and body, where reduced detail facilitates robust structural alignment. The second stage introduces a residual-guided diffusion process that reconstructs high-resolution outputs by refining the residual between the two scales, focusing on texture fidelity. In addition, our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks. By leveraging the semantic priors embedded in pretrained diffusion models, this design more effectively preserves the person's appearance and geometric consistency. Extensive experiments demonstrate that DS-VTON achieves state-of-the-art performance in both structural alignment and texture preservation across multiple standard virtual try-on benchmarks.
Related papers
- Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off [9.45991209383675]
We propose Voost - a unified framework that jointly learns virtual try-on and try-off with a single diffusion transformer.<n>Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
arXiv Detail & Related papers (2025-08-06T19:10:58Z) - OmniVTON: Training-Free Universal Virtual Try-On [53.31945401098557]
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality.<n>We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings.
arXiv Detail & Related papers (2025-07-20T16:37:53Z) - DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On [3.5655800569257896]
Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion.<n>We propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on.
arXiv Detail & Related papers (2025-06-29T15:31:42Z) - HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment [11.00877062567135]
We propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses.<n> HF-VTON consists of three key modules: the Appearance-Preserving Warp Alignment Module, the Semantic Representation Module, and the Multimodal Prior-Guided Appearance Generation Generation Module.<n> Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS.
arXiv Detail & Related papers (2025-05-26T07:55:49Z) - UniViTAR: Unified Vision Transformer with Native Resolution [37.63387029787732]
We introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario.<n>A progressive training paradigm is introduced, which strategically combines two core mechanisms.<n>In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model.
arXiv Detail & Related papers (2025-04-02T14:59:39Z) - Aligning Foundation Model Priors and Diffusion-Based Hand Interactions for Occlusion-Resistant Two-Hand Reconstruction [50.952228546326516]
Two-hand reconstruction from monocular images faces persistent challenges due to complex and dynamic hand postures and occlusions.<n>Existing approaches struggle with such alignment issues, often resulting in misalignment and penetration artifacts.<n>We propose a novel framework that attempts to precisely align hand poses and interactions by integrating foundation model-driven 2D priors with diffusion-based interaction refinement.
arXiv Detail & Related papers (2025-03-22T14:42:27Z) - One-Step Diffusion Model for Image Motion-Deblurring [85.76149042561507]
We propose a one-step diffusion model for deblurring (OSDD), a novel framework that reduces the denoising process to a single step.<n>To tackle fidelity loss in diffusion models, we introduce an enhanced variational autoencoder (eVAE), which improves structural restoration.<n>Our method achieves strong performance on both full and no-reference metrics.
arXiv Detail & Related papers (2025-03-09T09:39:57Z) - 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On [17.226542332700607]
We propose a novel single-network VTON method that overcomes the limitations of existing techniques.<n>Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs.<n>Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches.
arXiv Detail & Related papers (2025-01-09T16:49:04Z) - D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On [32.73798955587999]
D$4$-VTON is an innovative solution for image-based virtual try-on.
We address challenges from previous studies, such as semantic inconsistencies before and after garment warping.
arXiv Detail & Related papers (2024-07-21T10:40:53Z) - StableDreamer: Taming Noisy Score Distillation Sampling for Text-to-3D [88.66678730537777]
We present StableDreamer, a methodology incorporating three advances.
First, we formalize the equivalence of the SDS generative prior and a simple supervised L2 reconstruction loss.
Second, our analysis shows that while image-space diffusion contributes to geometric precision, latent-space diffusion is crucial for vivid color rendition.
arXiv Detail & Related papers (2023-12-02T02:27:58Z) - D-SCo: Dual-Stream Conditional Diffusion for Monocular Hand-Held Object Reconstruction [74.49121940466675]
We introduce centroid-fixed dual-stream conditional diffusion for monocular hand-held object reconstruction.
First, to avoid the object centroid from deviating, we utilize a novel hand-constrained centroid fixing paradigm.
Second, we introduce a dual-stream denoiser to semantically and geometrically model hand-object interactions.
arXiv Detail & Related papers (2023-11-23T20:14:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.