MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
- URL: http://arxiv.org/abs/2505.21325v3
- Date: Sat, 27 Sep 2025 10:43:45 GMT
- Title: MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on
- Authors: Guangyuan Li, Siming Zheng, Hao Zhang, Jinwei Chen, Junsheng Luan, Binkai Ou, Lei Zhao, Bo Li, Peng-Tao Jiang,
- Abstract summary: Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive frames, capturing both their dynamics and interactions with human cues.<n>Existing VVT methods still suffer from inadequate garment fidelity and limitedtemporal consistency.<n>We present MagicTryOn, a diffusion-transformer based framework for garment-constrained video virtual try-on.
- Score: 28.66545985357718
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive video frames, capturing both their dynamics and interactions with human motion. Despite recent progress, existing VVT methods still suffer from inadequate garment fidelity and limited spatiotemporal consistency. The reasons are: (1) under-exploitation of garment information, with limited garment cues being injected, resulting in weaker fine-detail fidelity; and (2) a lack of spatiotemporal modeling, which hampers cross-frame identity consistency and causes temporal jitter and appearance drift. In this paper, we present MagicTryOn, a diffusion-transformer based framework for garment-preserving video virtual try-on. To preserve fine-grained garment details, we propose a fine-grained garment-preservation strategy that disentangles garment cues and injects these decomposed priors into the denoising process. To improve temporal garment consistency and suppress jitter, we introduce a garment-aware spatiotemporal rotary positional embedding (RoPE) that extends RoPE within full self-attention, using spatiotemporal relative positions to modulate garment tokens. We further impose a mask-aware loss during training to enhance fidelity within garment regions. Moreover, we adopt distribution-matching distillation to compress the sampling trajectory to four steps, enabling real-time inference without degrading garment fidelity. Extensive quantitative and qualitative experiments demonstrate that MagicTryOn outperforms existing methods, delivering superior garment-detail fidelity and temporal stability in unconstrained settings.
Related papers
- Learning High-Fidelity Cloth Animation via Skinning-Free Image Transfer [64.49436559408049]
We present a novel method for generating 3D garment deformations from given body poses.<n>Our method significantly improves animation quality on various garment types and recovers finer wrinkles than state-of-the-art methods.
arXiv Detail & Related papers (2025-12-05T10:28:08Z) - Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection [21.00674585489938]
Video virtual try-on aims to replace the clothing of a person in a video with a target garment.<n>We propose OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement.
arXiv Detail & Related papers (2025-10-09T01:13:37Z) - Stable Video-Driven Portraits [52.008400639227034]
Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
arXiv Detail & Related papers (2025-09-22T08:11:08Z) - DualFit: A Two-Stage Virtual Try-On via Warping and Synthesis [8.082593574401704]
We propose DualFit to preserve fine-grained garment details such as logos and printed text elements.<n>In the first stage, DualFit warps the target garment to align with the person image using a learned flow field.<n>In the second stage, a fidelity-fidelity try-on module synthesizes the final output by blending the warped garment with preserved human regions.
arXiv Detail & Related papers (2025-08-16T18:50:31Z) - DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z) - ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On [19.565037902386475]
Video virtual try-on aims to seamlessly replace the clothing of a person in a video with a target garment.<n>Existing approaches still struggle to maintain continuity and reproduce garment details.<n> ChronoTailor, a diffusion-based framework, generates temporally consistent videos while preserving fine-grained garment details.
arXiv Detail & Related papers (2025-06-06T08:26:39Z) - Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction [142.66410908560582]
Video virtual try-on aims to seamlessly dress a subject in a video physique with a specific garment.<n>We propose Dynamic Pose Interaction Diffusion Models (DPIDM) to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on.<n>DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency.
arXiv Detail & Related papers (2025-05-22T17:52:34Z) - CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation [75.10635392993748]
We introduce CatV2TON, a vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks.<n>By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance.<n>We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing.
arXiv Detail & Related papers (2025-01-20T08:09:36Z) - RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency [26.410982262831975]
RealVVT is a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts.<n>Our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks.
arXiv Detail & Related papers (2025-01-15T09:22:38Z) - ODPG: Outfitting Diffusion with Pose Guided Condition [2.5602836891933074]
VTON technology allows users to visualize how clothes would look on them without physically trying them on.<n>Traditional VTON methods, often using Geneversarative Adrial Networks (GANs) and Diffusion models, face challenges in achieving high realism and handling dynamic poses.<n>This paper introduces Outfitting Diffusion with Pose Guided Condition (ODPG), a novel approach that leverages a latent diffusion model with multiple conditioning inputs during the denoising process.
arXiv Detail & Related papers (2025-01-12T10:30:27Z) - STAR: Spatial-Temporal Augmentation with Text-to-Video Models for Real-World Video Super-Resolution [42.859188375578604]
Image diffusion models have been adapted for real-world video superresolution to tackle over-smoothing issues in GAN-based methods.<n>These models struggle to maintain temporal consistency, as they are trained on static images.<n>We introduce a novel approach that leverages T2V models for real-world video super-resolution, achieving realistic spatial details and robust temporal consistency.
arXiv Detail & Related papers (2025-01-06T12:36:21Z) - Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism [52.9091817868613]
Video try-on is a promising area for its tremendous real-world potential.<n>Previous research has primarily focused on transferring product clothing images to videos with simple human poses.<n>We propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On.
arXiv Detail & Related papers (2024-12-13T03:20:53Z) - FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on [73.13242624924814]
Garment perception enhancement technique, FitDiT, is designed for high-fidelity virtual try-on using Diffusion Transformers (DiT)
We introduce a garment texture extractor that incorporates garment priors evolution to fine-tune garment feature, facilitating to better capture rich details such as stripes, patterns, and text.
We also employ a dilated-relaxed mask strategy that adapts to the correct length of garments, preventing the generation of garments that fill the entire mask area during cross-category try-on.
arXiv Detail & Related papers (2024-11-15T11:02:23Z) - Improving Virtual Try-On with Garment-focused Diffusion Models [91.95830983115474]
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks.
We shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process.
Experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.
arXiv Detail & Related papers (2024-09-12T17:55:11Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - GraVITON: Graph based garment warping with attention guided inversion for Virtual-tryon [5.790630195329777]
We introduce a novel graph based warping technique which emphasizes the value of context in garment flow.
Our method, validated on VITON-HD and Dresscode datasets, showcases substantial improvement in garment warping, texture preservation, and overall realism.
arXiv Detail & Related papers (2024-06-04T10:29:18Z) - AnyFit: Controllable Virtual Try-on for Any Combination of Attire Across Any Scenario [50.62711489896909]
AnyFit surpasses all baselines on high-resolution benchmarks and real-world data by a large gap.
AnyFit's impressive performance on high-fidelity virtual try-ons in any scenario from any image, paves a new path for future research within the fashion community.
arXiv Detail & Related papers (2024-05-28T13:33:08Z) - Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World
Video Super-Resolution [65.91317390645163]
Upscale-A-Video is a text-guided latent diffusion framework for video upscaling.
It ensures temporal coherence through two key mechanisms: locally, it integrates temporal layers into U-Net and VAE-Decoder, maintaining consistency within short sequences.
It also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation.
arXiv Detail & Related papers (2023-12-11T18:54:52Z) - WarpDiffusion: Efficient Diffusion Model for High-Fidelity Virtual
Try-on [81.15988741258683]
Image-based Virtual Try-On (VITON) aims to transfer an in-shop garment image onto a target person.
Current methods often overlook the synthesis quality around the garment-skin boundary and realistic effects like wrinkles and shadows on the warped garments.
We propose WarpDiffusion, which bridges the warping-based and diffusion-based paradigms via a novel informative and local garment feature attention mechanism.
arXiv Detail & Related papers (2023-12-06T18:34:32Z) - ClothFormer:Taming Video Virtual Try-on in All Module [12.084652803378598]
Video virtual try-on aims to fit the target clothes to a person in the video with-temporal consistent results.
ClothFormer framework successfully synthesizes realistic, temporal consistent results in complicated environment.
arXiv Detail & Related papers (2022-04-26T08:40:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.