Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
- URL: http://arxiv.org/abs/2508.04825v1
- Date: Wed, 06 Aug 2025 19:10:58 GMT
- Title: Voost: A Unified and Scalable Diffusion Transformer for Bidirectional Virtual Try-On and Try-Off
- Authors: Seungyong Lee, Jeong-gi Kwak,
- Abstract summary: We propose Voost - a unified framework that jointly learns virtual try-on and try-off with a single diffusion transformer.<n>Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
- Score: 9.45991209383675
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Virtual try-on aims to synthesize a realistic image of a person wearing a target garment, but accurately modeling garment-body correspondence remains a persistent challenge, especially under pose and appearance variation. In this paper, we propose Voost - a unified and scalable framework that jointly learns virtual try-on and try-off with a single diffusion transformer. By modeling both tasks jointly, Voost enables each garment-person pair to supervise both directions and supports flexible conditioning over generation direction and garment category, enhancing garment-body relational reasoning without task-specific networks, auxiliary losses, or additional labels. In addition, we introduce two inference-time techniques: attention temperature scaling for robustness to resolution or mask variation, and self-corrective sampling that leverages bidirectional consistency between tasks. Extensive experiments demonstrate that Voost achieves state-of-the-art results on both try-on and try-off benchmarks, consistently outperforming strong baselines in alignment accuracy, visual fidelity, and generalization.
Related papers
- OmniVTON: Training-Free Universal Virtual Try-On [53.31945401098557]
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality.<n>We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings.
arXiv Detail & Related papers (2025-07-20T16:37:53Z) - DiffFit: Disentangled Garment Warping and Texture Refinement for Virtual Try-On [3.5655800569257896]
Virtual try-on (VTON) aims to synthesize realistic images of a person wearing a target garment, with broad applications in e-commerce and digital fashion.<n>We propose DiffFit, a novel two-stage latent diffusion framework for high-fidelity virtual try-on.
arXiv Detail & Related papers (2025-06-29T15:31:42Z) - DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation [38.499761393356124]
DS-VTON is a dual-scale virtual try-on framework that disentangles objectives for more effective modeling.<n>Our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks.
arXiv Detail & Related papers (2025-06-01T08:52:57Z) - HF-VTON: High-Fidelity Virtual Try-On via Consistent Geometric and Semantic Alignment [11.00877062567135]
We propose HF-VTON, a novel framework that ensures high-fidelity virtual try-on performance across diverse poses.<n> HF-VTON consists of three key modules: the Appearance-Preserving Warp Alignment Module, the Semantic Representation Module, and the Multimodal Prior-Guided Appearance Generation Generation Module.<n> Experimental results demonstrate that HF-VTON outperforms state-of-the-art methods on both VITON-HD and SAMP-VTONS.
arXiv Detail & Related papers (2025-05-26T07:55:49Z) - Incorporating Visual Correspondence into Diffusion Model for Virtual Try-On [89.9123806553489]
Diffusion models have shown success in virtual try-on (VTON) task.<n>The problem remains challenging to preserve the shape and every detail of the given garment due to the intrinsicity of diffusion model.<n>We propose to explicitly capitalize on visual correspondence as the prior to tame diffusion process.
arXiv Detail & Related papers (2025-05-22T17:52:13Z) - UniViTAR: Unified Vision Transformer with Native Resolution [37.63387029787732]
We introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario.<n>A progressive training paradigm is introduced, which strategically combines two core mechanisms.<n>In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model.
arXiv Detail & Related papers (2025-04-02T14:59:39Z) - Hierarchical Cross-Attention Network for Virtual Try-On [59.50297858307268]
We present an innovative solution for the challenges of the virtual try-on task: our novel Hierarchical Cross-Attention Network (HCANet)
HCANet is crafted with two primary stages: geometric matching and try-on, each playing a crucial role in delivering realistic virtual try-on outcomes.
A key feature of HCANet is the incorporation of a novel Hierarchical Cross-Attention (HCA) block into both stages, enabling the effective capture of long-range correlations between individual and clothing modalities.
arXiv Detail & Related papers (2024-11-23T12:39:58Z) - Improving Diffusion Models for Authentic Virtual Try-on in the Wild [53.96244595495942]
This paper considers image-based virtual try-on, which renders an image of a person wearing a curated garment.
We propose a novel diffusion model that improves garment fidelity and generates authentic virtual try-on images.
We present a customization method using a pair of person-garment images, which significantly improves fidelity and authenticity.
arXiv Detail & Related papers (2024-03-08T08:12:18Z) - Drivable Volumetric Avatars using Texel-Aligned Features [52.89305658071045]
Photo telepresence requires both high-fidelity body modeling and faithful driving to enable dynamically synthesized appearance.
We propose an end-to-end framework that addresses two core challenges in modeling and driving full-body avatars of real people.
arXiv Detail & Related papers (2022-07-20T09:28:16Z) - Single Stage Virtual Try-on via Deformable Attention Flows [51.70606454288168]
Virtual try-on aims to generate a photo-realistic fitting result given an in-shop garment and a reference person image.
We develop a novel Deformable Attention Flow (DAFlow) which applies the deformable attention scheme to multi-flow estimation.
Our proposed method achieves state-of-the-art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2022-07-19T10:01:31Z) - Cloth Interactive Transformer for Virtual Try-On [106.21605249649957]
We propose a novel two-stage cloth interactive transformer (CIT) method for the virtual try-on task.
In the first stage, we design a CIT matching block, aiming to precisely capture the long-range correlations between the cloth-agnostic person information and the in-shop cloth information.
In the second stage, we put forth a CIT reasoning block for establishing global mutual interactive dependencies among person representation, the warped clothing item, and the corresponding warped cloth mask.
arXiv Detail & Related papers (2021-04-12T14:45:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.