Video Virtual Try-on with Conditional Diffusion Transformer Inpainter
- URL: http://arxiv.org/abs/2506.21270v1
- Date: Thu, 26 Jun 2025 13:56:27 GMT
- Title: Video Virtual Try-on with Conditional Diffusion Transformer Inpainter
- Authors: Cheng Zou, Senlin Cheng, Bolei Xu, Dandan Zheng, Xiaobo Li, Jingdong Chen, Ming Yang,
- Abstract summary: Video virtual try-on aims to fit a garment to a target person in consecutive video frames.<n>Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution.<n>We propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task.
- Score: 27.150975905047968
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video virtual try-on aims to naturally fit a garment to a target person in consecutive video frames. It is a challenging task, on the one hand, the output video should be in good spatial-temporal consistency, on the other hand, the details of the given garment need to be preserved well in all the frames. Naively using image-based try-on methods frame by frame can get poor results due to severe inconsistency. Recent diffusion-based video try-on methods, though very few, happen to coincide with a similar solution: inserting temporal attention into image-based try-on model to adapt it for video try-on task, which have shown improvements but there still exist inconsistency problems. In this paper, we propose ViTI (Video Try-on Inpainter), formulate and implement video virtual try-on as a conditional video inpainting task, which is different from previous methods. In this way, we start with a video generation problem instead of an image-based try-on problem, which from the beginning has a better spatial-temporal consistency. Specifically, at first we build a video inpainting framework based on Diffusion Transformer with full 3D spatial-temporal attention, and then we progressively adapt it for video garment inpainting, with a collection of masking strategies and multi-stage training. After these steps, the model can inpaint the masked garment area with appropriate garment pixels according to the prompt with good spatial-temporal consistency. Finally, as other try-on methods, garment condition is added to the model to make sure the inpainted garment appearance and details are as expected. Both quantitative and qualitative experimental results show that ViTI is superior to previous works.
Related papers
- OutDreamer: Video Outpainting with a Diffusion Transformer [37.512451098188635]
We introduce OutDreamer, a DiT-based video outpainting framework.<n>We propose a mask-driven self-attention layer that dynamically integrates the given mask information.<n>For long video outpainting, we employ a cross-video-clip refiner to iteratively generate missing content.
arXiv Detail & Related papers (2025-06-27T15:08:54Z) - FramePainter: Endowing Interactive Image Editing with Video Diffusion Priors [64.54220123913154]
We introduce FramePainter as an efficient instantiation of image-to-video generation problem.<n>It only uses a lightweight sparse control encoder to inject editing signals.<n>It domainantly outperforms previous state-of-the-art methods with far less training data.
arXiv Detail & Related papers (2025-01-14T16:09:16Z) - SwiftTry: Fast and Consistent Video Virtual Try-On with Diffusion Models [10.66567645920237]
Given an input video of a person and a new garment, the objective of this paper is to synthesize a new video where the person is wearing the garment while maintaining temporal consistency.<n>We reconceptualize video virtual try-on as a conditional video inpainting task, with garments serving as input conditions.<n>Specifically, our approach enhances image diffusion models by incorporating temporal attention layers to improve temporal coherence.
arXiv Detail & Related papers (2024-12-13T14:50:26Z) - Warped Diffusion: Solving Video Inverse Problems with Image Diffusion Models [56.691967706131]
We view frames as continuous functions in the 2D space, and videos as a sequence of continuous warping transformations between different frames.
This perspective allows us to train function space diffusion models only on images and utilize them to solve temporally correlated inverse problems.
Our method allows us to deploy state-of-the-art latent diffusion models such as Stable Diffusion XL to solve video inverse problems.
arXiv Detail & Related papers (2024-10-21T16:19:34Z) - Replace Anyone in Videos [82.37852750357331]
We present the ReplaceAnyone framework, which focuses on localized human replacement and insertion featuring intricate backgrounds.<n>We formulate this task as an image-conditioned video inpainting paradigm with pose guidance, utilizing a unified end-to-end video diffusion architecture.<n>The proposed ReplaceAnyone can be seamlessly applied not only to traditional 3D-UNet base models but also to DiT-based video models such as Wan2.1.
arXiv Detail & Related papers (2024-09-30T03:27:33Z) - Video Diffusion Models are Strong Video Inpainter [14.402778136825642]
We propose a novel First Frame Filling Video Diffusion Inpainting model (FFF-VDI)<n>We propagate the noise latent information of future frames to fill the masked areas of the first frame's noise latent code.<n>Next, we fine-tune the pre-trained image-to-video diffusion model to generate the inpainted video.
arXiv Detail & Related papers (2024-08-21T08:01:00Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z) - ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person.
Previous video-based try-on solutions can only generate low visual quality and blurring results.
We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z) - Learning Joint Spatial-Temporal Transformations for Video Inpainting [58.939131620135235]
We propose to learn a joint Spatial-Temporal Transformer Network (STTN) for video inpainting.
We simultaneously fill missing regions in all input frames by self-attention, and propose to optimize STTN by a spatial-temporal adversarial loss.
arXiv Detail & Related papers (2020-07-20T16:35:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.