DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework
- URL: http://arxiv.org/abs/2508.02807v1
- Date: Mon, 04 Aug 2025 18:27:55 GMT
- Title: DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework
- Authors: Tongchun Zuo, Zaiyu Huang, Shuliang Ning, Ente Lin, Chao Liang, Zerong Zheng, Jianwen Jiang, Yuan Zhang, Mingyuan Gao, Xin Dong,
- Abstract summary: virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
- Score: 26.661935208583756
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment. However, most existing end-to-end methods rely heavily on scarce paired garment-centric datasets and fail to effectively leverage priors of advanced visual models and test-time inputs, making it challenging to accurately preserve fine-grained garment details and maintain temporal consistency in unconstrained scenarios. To address these challenges, we propose DreamVVT, a carefully designed two-stage framework built upon Diffusion Transformers (DiTs), which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios. To further leverage prior knowledge from pretrained models and test-time inputs, in the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent keyframe try-on images. These images serve as complementary appearance guidance for subsequent video generation. \textbf{In the second stage}, skeleton maps together with fine-grained motion and appearance descriptions are extracted from the input content, and these along with the keyframe try-on images are then fed into a pretrained video generation model enhanced with LoRA adapters. This ensures long-term temporal coherence for unseen regions and enables highly plausible dynamic motions. Extensive quantitative and qualitative experiments demonstrate that DreamVVT surpasses existing methods in preserving detailed garment content and temporal stability in real-world scenarios. Our project page https://virtu-lab.github.io/
Related papers
- LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [16.0505428363005]
We propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.<n>We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to model the garment consistency of videos.<n>Our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
arXiv Detail & Related papers (2025-05-27T15:22:02Z) - SkyReels-V2: Infinite-length Film Generative Model [35.00453687783287]
We propose SkyReels-V2, an Infinite-length Film Generative Model, that synergizes Multi-modal Large Language Model (MLLM), Multi-stage Pretraining, Reinforcement Learning, and Diffusion Forcing Framework.<n>We establish progressive-resolution pretraining for the fundamental video generation, followed by a four-stage post-training enhancement.
arXiv Detail & Related papers (2025-04-17T16:37:27Z) - CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation [75.10635392993748]
We introduce CatV2TON, a vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks.<n>By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance.<n>We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing.
arXiv Detail & Related papers (2025-01-20T08:09:36Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - AID: Adapting Image2Video Diffusion Models for Instruction-guided Video Prediction [88.70116693750452]
Text-guided video prediction (TVP) involves predicting the motion of future frames from the initial frame according to an instruction.
Previous TVP methods make significant breakthroughs by adapting Stable Diffusion for this task.
We introduce the Multi-Modal Large Language Model (MLLM) to predict future video states based on initial frames and text instructions.
arXiv Detail & Related papers (2024-06-10T17:02:08Z) - SalFoM: Dynamic Saliency Prediction with Video Foundation Models [37.25208752620703]
Video saliency prediction (VSP) has shown promising performance compared to the human visual system.
We introduce SalFoM, a novel encoder-decoder video transformer architecture.
Our model employs UnMasked Teacher (UMT) extractor and presents a heterogeneous decoder-aware informationtemporal transformer.
arXiv Detail & Related papers (2024-04-03T22:38:54Z) - Exploring Pre-trained Text-to-Video Diffusion Models for Referring Video Object Segmentation [72.90144343056227]
We explore the visual representations produced from a pre-trained text-to-video (T2V) diffusion model for video understanding tasks.
We introduce a novel framework, termed "VD-IT", tailored with dedicatedly designed components built upon a fixed T2V model.
Our VD-IT achieves highly competitive results, surpassing many existing state-of-the-art methods.
arXiv Detail & Related papers (2024-03-18T17:59:58Z) - LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion
Models [133.088893990272]
We learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis.
We propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models.
arXiv Detail & Related papers (2023-09-26T17:52:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.