Related papers: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on

URL: http://arxiv.org/abs/2505.21325v2
Date: Wed, 28 May 2025 12:45:16 GMT
Title: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
Authors: Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang,
Abstract summary: We propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.<n>We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to model the garment consistency of videos.<n>Our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
Score: 16.0505428363005
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Video Virtual Try-On (VVT) aims to simulate the natural appearance of garments across consecutive video frames, capturing their dynamic variations and interactions with human body motion. However, current VVT methods still face challenges in terms of spatiotemporal consistency and garment content preservation. First, they use diffusion models based on the U-Net, which are limited in their expressive capability and struggle to reconstruct complex details. Second, they adopt a separative modeling approach for spatial and temporal attention, which hinders the effective capture of structural relationships and dynamic consistency across frames. Third, their expression of garment details remains insufficient, affecting the realism and stability of the overall synthesized results, especially during human motion. To address the above challenges, we propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer. We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to jointly model the spatiotemporal consistency of videos. We design a coarse-to-fine garment preservation strategy. The coarse strategy integrates garment tokens during the embedding stage, while the fine strategy incorporates multiple garment-based conditions, such as semantics, textures, and contour lines during the denoising stage. Moreover, we introduce a mask-aware loss to further optimize garment region fidelity. Extensive experiments on both image and video try-on datasets demonstrate that our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.

Related papers

DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z)
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction [142.66410908560582]
Video virtual try-on aims to seamlessly dress a subject in a video physique with a specific garment.<n>We propose Dynamic Pose Interaction Diffusion Models (DPIDM) to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on.<n>DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency.
arXiv Detail & Related papers (2025-05-22T17:52:34Z)
CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation [75.10635392993748]
We introduce CatV2TON, a vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks.<n>By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance.<n>We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing.
arXiv Detail & Related papers (2025-01-20T08:09:36Z)
RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency [26.410982262831975]
RealVVT is a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts.<n>Our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks.
arXiv Detail & Related papers (2025-01-15T09:22:38Z)
ODPG: Outfitting Diffusion with Pose Guided Condition [2.5602836891933074]
VTON technology allows users to visualize how clothes would look on them without physically trying them on.<n>Traditional VTON methods, often using Geneversarative Adrial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses.<n>This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process.
arXiv Detail & Related papers (2025-01-12T10:30:27Z)
STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.<n>These models struggle to maintain temporal consistency, as they are trained on static images.<n>We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z)
Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism [52.9091817868613]
Video try-on is a promising area for its tremendous real-world potential.<n>Previous research has primarily focused on transferring product clothing images to videos with simple human poses.<n>We propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On.
arXiv Detail & Related papers (2024-12-13T03:20:53Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario [50.62711489896909]
AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap. AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.
arXiv Detail & Related papers (2024-05-28T13:33:08Z)
Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling. It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences. It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z)
ClothFormer:Taming Video Virtual Try-on in All Module [12.084652803378598]
Video virtual try-on aims to fit the target clothes to a person in the video with-temporal consistent results. ClothFormer framework successfully synthesizes realistic, temporal consistent results in complicated environment.
arXiv Detail & Related papers (2022-04-26T08:40:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.