Related papers: Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection

Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection

URL: http://arxiv.org/abs/2510.07654v1
Date: Thu, 09 Oct 2025 01:13:37 GMT
Title: Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection
Authors: Yanjie Pan, Qingdong He, Lidong Wang, Bo Peng, Mingmin Chi,
Abstract summary: Video virtual try-on aims to replace the clothing of a person in a video with a target garment.<n>We propose OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement.
Score: 21.00674585489938
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video virtual try-on aims to replace the clothing of a person in a video with a target garment. Current dual-branch architectures have achieved significant success in diffusion models based on the U-Net; however, adapting them to diffusion models built upon the Diffusion Transformer remains challenging. Initially, introducing latent space features from the garment reference branch requires adding or modifying the backbone network, leading to a large number of trainable parameters. Subsequently, the latent space features of garments lack inherent temporal characteristics and thus require additional learning. To address these challenges, we propose a novel approach, OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement: specifically, we employ an image-based clothing transfer model to replace the clothing in the initial frame, and then, under the content control of the edited first frame, utilize pose and mask information to guide the temporal prior of the video generation model in synthesizing the remaining frames sequentially. Experiments show that our method achieves superior parameter efficiency and computational efficiency while still maintaining leading performance under these constraints.

Related papers

DressWild: Feed-Forward Pose-Agnostic Garment Sewing Pattern Generation from In-the-Wild Images [50.11081091174558]
This paper focuses on sewing pattern generation for garment modeling and fabrication applications.<n>We propose DressWild, a novel feed-forward pipeline that reconstructs physics-consistent 2D sewing patterns and the corresponding 3D garments from a single in-the-wild image.
arXiv Detail & Related papers (2026-02-18T14:45:15Z)
MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [28.66545985357718]
Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive frames, capturing both their dynamics and interactions with human cues.<n>Existing VVT methods still suffer from inadequate garment fidelity and limitedtemporal consistency.<n>We present MagicTryOn, a diffusion-transformer based framework for garment-constrained video virtual try-on.
arXiv Detail & Related papers (2025-05-27T15:22:02Z)
SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models [10.66567645920237]
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the garment while maintaining temporal consistency.<n>We reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions.<n>Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence.
arXiv Detail & Related papers (2024-12-13T14:50:26Z)
Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism [52.9091817868613]
Video try-on is a promising area for its tremendous real-world potential.<n>Previous research has primarily focused on transferring product clothing images to videos with simple human poses.<n>We propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On.
arXiv Detail & Related papers (2024-12-13T03:20:53Z)
FitDiT: Advancing the Authentic Garment Details for High-fidelity Virtual Try-on [73.13242624924814]
Garment perception enhancement technique, FitDiT, is designed for high-fidelity virtual try-on using Diffusion Transformers (DiT) We introduce a garment texture extractor that incorporates garment priors evolution to fine-tune garment feature, facilitating to better capture rich details such as stripes, patterns, and text. We also employ a dilated-relaxed mask strategy that adapts to the correct length of garments, preventing the generation of garments that fill the entire mask area during cross-category try-on.
arXiv Detail & Related papers (2024-11-15T11:02:23Z)
Improving Virtual Try-On with Garment-focused Diffusion Models [91.95830983115474]
Diffusion models have led to the revolutionizing of generative modeling in numerous image synthesis tasks. We shape a new Diffusion model, namely GarDiff, which triggers the garment-focused diffusion process. Experiments on VITON-HD and DressCode datasets demonstrate the superiority of our GarDiff when compared to state-of-the-art VTON approaches.
arXiv Detail & Related papers (2024-09-12T17:55:11Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person. Previous video-based try-on solutions can only generate low visual quality and blurring results. We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z)
Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework [33.46782517803435]
Make-Your-Anchor is a system requiring only a one-minute video clip of an individual for training. We finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. A novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos.
arXiv Detail & Related papers (2024-03-25T07:54:18Z)
Customize-A-Video: One-Shot Motion Customization of Text-to-Video Diffusion Models [48.56724784226513]
We propose Customize-A-Video that models the motion from a single reference video and adapts it to new subjects and scenes with both spatial and temporal varieties. The proposed modules are trained in a staged pipeline and inferred in a plug-and-play fashion, enabling easy extensions to various downstream tasks.
arXiv Detail & Related papers (2024-02-22T18:38:48Z)
MonoClothCap: Towards Temporally Coherent Clothing Capture from Monocular RGB Video [10.679773937444445]
We present a method to capture temporally coherent dynamic clothing deformation from a monocular RGB video input. We build statistical deformation models for three types of clothing: T-shirt, short pants and long pants. Our method produces temporally coherent reconstruction of body and clothing from monocular video.
arXiv Detail & Related papers (2020-09-22T17:54:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.