Related papers: Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism

URL: http://arxiv.org/abs/2412.09822v1
Date: Fri, 13 Dec 2024 03:20:53 GMT
Title: Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism
Authors: Jun Zheng, Jing Wang, Fuwei Zhao, Xujie Zhang, Xiaodan Liang,
Abstract summary: Video try-on is a promising area for its tremendous real-world potential.<n>Previous research has primarily focused on transferring product clothing images to videos with simple human poses.<n>We propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On.
Score: 52.9091817868613
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video try-on stands as a promising area for its tremendous real-world potential. Previous research on video try-on has primarily focused on transferring product clothing images to videos with simple human poses, while performing poorly with complex movements. To better preserve clothing details, those approaches are armed with an additional garment encoder, resulting in higher computational resource consumption. The primary challenges in this domain are twofold: (1) leveraging the garment encoder's capabilities in video try-on while lowering computational requirements; (2) ensuring temporal consistency in the synthesis of human body parts, especially during rapid movements. To tackle these issues, we propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On. To reduce computational overhead, we adopt a straightforward approach by utilizing the DiT backbone itself as the garment encoder and employing a dynamic feature fusion module to store and integrate garment features. To ensure temporal consistency of human body parts, we introduce a limb-aware dynamic attention module that enforces the DiT backbone to focus on the regions of human limbs during the denoising process. Extensive experiments demonstrate the superiority of Dynamic Try-On in generating stable and smooth try-on results, even for videos featuring complicated human postures.

Related papers

MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [16.0505428363005]
We propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.<n>We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to model the garment consistency of videos.<n>Our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
arXiv Detail & Related papers (2025-05-27T15:22:02Z)
Pursuing Temporal-Consistent Video Virtual Try-On via Dynamic Pose Interaction [142.66410908560582]
Video virtual try-on aims to seamlessly dress a subject in a video physique with a specific garment.<n>We propose Dynamic Pose Interaction Diffusion Models (DPIDM) to leverage diffusion models to delve into dynamic pose interactions for video virtual try-on.<n>DPIDM capitalizes on a temporal regularized attention loss between consecutive frames to enhance temporal consistency.
arXiv Detail & Related papers (2025-05-22T17:52:34Z)
X-Dyna: Expressive Dynamic Human Image Animation [49.896933584815926]
X-Dyna is a zero-shot, diffusion-based pipeline for animating a single human image. It generates realistic, context-aware dynamics for both the subject and the surrounding environment.
arXiv Detail & Related papers (2025-01-17T08:10:53Z)
VITON-DiT: Learning In-the-Wild Video Try-On from Human Dance Videos via Diffusion Transformers [53.45587477621942]
We propose the first DiT-based video try-on framework for practical in-the-wild applications, named VITON-DiT. Specifically, VITON-DiT consists of a garment extractor, a Spatial-Temporal denoising DiT, and an identity preservation ControlNet. We also introduce random selection strategies during training and an Interpolated Auto-Regressive (IAR) technique at inference to facilitate long video generation.
arXiv Detail & Related papers (2024-05-28T16:21:03Z)
ViViD: Video Virtual Try-on using Diffusion Models [46.710863047471264]
Video virtual try-on aims to transfer a clothing item onto the video of a target person. Previous video-based try-on solutions can only generate low visual quality and blurring results. We present ViViD, a novel framework employing powerful diffusion models to tackle the task of video virtual try-on.
arXiv Detail & Related papers (2024-05-20T05:28:22Z)
AniDress: Animatable Loose-Dressed Avatar from Sparse Views Using Garment Rigging Model [58.035758145894846]
We introduce AniDress, a novel method for generating animatable human avatars in loose clothes using very sparse multi-view videos. A pose-driven deformable neural radiance field conditioned on both body and garment motions is introduced, providing explicit control of both parts. Our method is able to render natural garment dynamics that deviate highly from the body and well to generalize to both unseen views and poses.
arXiv Detail & Related papers (2024-01-27T08:48:18Z)
Dance In the Wild: Monocular Human Animation with Neural Dynamic Appearance Synthesis [56.550999933048075]
We propose a video based synthesis method that tackles challenges and demonstrates high quality results for in-the-wild videos. We introduce a novel motion signature that is used to modulate the generator weights to capture dynamic appearance changes. We evaluate our method on a set of challenging videos and show that our approach achieves state-of-the art performance both qualitatively and quantitatively.
arXiv Detail & Related papers (2021-11-10T20:18:57Z)
Render In-between: Motion Guided Video Synthesis for Action Interpolation [53.43607872972194]
We propose a motion-guided frame-upsampling framework that is capable of producing realistic human motion and appearance. A novel motion model is trained to inference the non-linear skeletal motion between frames by leveraging a large-scale motion-capture dataset. Our pipeline only requires low-frame-rate videos and unpaired human motion data but does not require high-frame-rate videos for training.
arXiv Detail & Related papers (2021-11-01T15:32:51Z)
High-Fidelity Neural Human Motion Transfer from Monocular Video [71.75576402562247]
Video-based human motion transfer creates video animations of humans following a source motion. We present a new framework which performs high-fidelity and temporally-consistent human motion transfer with natural pose-dependent non-rigid deformations. In the experimental results, we significantly outperform the state-of-the-art in terms of video realism.
arXiv Detail & Related papers (2020-12-20T16:54:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.