Related papers: Fashion-VDM: Video Diffusion Model for Virtual Try-On

Fashion-VDM: Video Diffusion Model for Virtual Try-On

URL: http://arxiv.org/abs/2411.00225v2
Date: Mon, 04 Nov 2024 16:46:01 GMT
Title: Fashion-VDM: Video Diffusion Model for Virtual Try-On
Authors: Johanna Karras, Yingwei Li, Nan Liu, Luyang Zhu, Innfarn Yoo, Andreas Lugmayr, Chris Lee, Ira Kemelmacher-Shlizerman,
Abstract summary: We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment.
Score: 17.284966713669927
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: We present Fashion-VDM, a video diffusion model (VDM) for generating virtual try-on videos. Given an input garment image and person video, our method aims to generate a high-quality try-on video of the person wearing the given garment, while preserving the person's identity and motion. Image-based virtual try-on has shown impressive results; however, existing video virtual try-on (VVT) methods are still lacking garment details and temporal consistency. To address these issues, we propose a diffusion-based architecture for video virtual try-on, split classifier-free guidance for increased control over the conditioning inputs, and a progressive temporal training strategy for single-pass 64-frame, 512px video generation. We also demonstrate the effectiveness of joint image-video training for video try-on, especially when video data is limited. Our qualitative and quantitative experiments show that our approach sets the new state-of-the-art for video virtual try-on. For additional results, visit our project page: https://johannakarras.github.io/Fashion-VDM.

Related papers

CatV2TON: Taming Diffusion Transformers for Vision-Based Virtual Try-On with Temporal Concatenation [75.10635392993748]
We introduce CatV2TON, a vision-based virtual try-on (V2TON) method that supports both image and video try-on tasks. By temporally concatenating garment and person inputs and training on a mix of image and video datasets, CatV2TON achieves robust try-on performance. We also present ViViD-S, a refined video try-on dataset, achieved by filtering back-facing frames and applying 3D mask smoothing.
arXiv Detail & Related papers (2025-01-20T08:09:36Z)
RealVVT: Towards Photorealistic Video Virtual Try-on via Spatio-Temporal Consistency [26.410982262831975]
RealVVT is a photoRealistic Video Virtual Try-on framework tailored to bolster stability and realism within dynamic video contexts. Our approach outperforms existing state-of-the-art models in both single-image and video VTO tasks.
arXiv Detail & Related papers (2025-01-15T09:22:38Z)
Video Creation by Demonstration [59.389591010842636]
We present $delta$-Diffusion, a self-supervised training approach that learns from unlabeled videos by conditional future frame prediction. By leveraging a video foundation model with an appearance bottleneck design on top, we extract action latents from demonstration videos for conditioning the generation process. Empirically, $delta$-Diffusion outperforms related baselines in terms of both human preference and large-scale machine evaluations.
arXiv Detail & Related papers (2024-12-12T18:41:20Z)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos. Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions. We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion. Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers [53.45587477621942]
We propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. We also introduce random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation.
arXiv Detail & Related papers (2024-05-28T16:21:03Z)
ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person. Previous video-based try-on solutions can only generate low visual quality and blurring results. We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z)
DreamPose: Fashion Image-to-Video Synthesis via Stable Diffusion [63.179505586264014]
We present DreamPose, a diffusion-based method for generating animated fashion videos from still images. Given an image and a sequence of human body poses, our method synthesizes a video containing both human and fabric motion.
arXiv Detail & Related papers (2023-04-12T17:59:17Z)
SinFusion: Training Diffusion Models on a Single Image or Video [11.473177123332281]
Diffusion models exhibited tremendous progress in image and video generation, exceeding GANs in quality and diversity. In this paper we show how this can be resolved by training a diffusion model on a single input image or video. Our image/video-specific diffusion model (SinFusion) learns the appearance and dynamics of the single image or video, while utilizing the conditioning capabilities of diffusion models.
arXiv Detail & Related papers (2022-11-21T18:59:33Z)
Deep Video Prior for Video Consistency and Propagation [58.250209011891904]
We present a novel and general approach for blind video temporal consistency. Our method is only trained on a pair of original and processed videos directly instead of a large dataset. We show that temporal consistency can be achieved by training a convolutional neural network on a video with Deep Video Prior.
arXiv Detail & Related papers (2022-01-27T16:38:52Z)
Video Content Swapping Using GAN [1.2300363114433952]
In this work, we will break down any frame in the video into content and pose. We first extract the pose information from a video using a pre-trained human pose detection and use a generative model to synthesize the video based on the content code and pose code.
arXiv Detail & Related papers (2021-11-21T23:01:58Z)
Diverse Generation from a Single Video Made Possible [24.39972895902724]
We present a fast and practical method for video generation and manipulation from a single natural video. Our method generates more realistic and higher quality results than single-video GANs.
arXiv Detail & Related papers (2021-09-17T15:12:17Z)
MV-TON: Memory-based Video Virtual Try-on network [49.496817042974456]
We propose a Memory-based Video virtual Try-On Network (MV-TON) MV-TON seamlessly transfers desired clothes to a target person without using any clothing templates and generates high-resolution realistic videos. Experimental results show the effectiveness of our method in the video virtual try-on task and its superiority over other existing methods.
arXiv Detail & Related papers (2021-08-17T08:35:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.