ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On
- URL: http://arxiv.org/abs/2506.05858v1
- Date: Fri, 06 Jun 2025 08:26:39 GMT
- Title: ChronoTailor: Harnessing Attention Guidance for Fine-Grained Video Virtual Try-On
- Authors: Jinjuan Wang, Wenzhang Sun, Ming Li, Yun Zheng, Fanyao Li, Zhulin Tao, Donglin Di, Hao Li, Wei Chen, Xianglin Huang,
- Abstract summary: Video virtual try-on aims to seamlessly replace the clothing of a person in a video with a target garment.<n>Existing approaches still struggle to maintain continuity and reproduce garment details.<n> ChronoTailor, a diffusion-based framework, generates temporally consistent videos while preserving fine-grained garment details.
- Score: 19.565037902386475
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Video virtual try-on aims to seamlessly replace the clothing of a person in a source video with a target garment. Despite significant progress in this field, existing approaches still struggle to maintain continuity and reproduce garment details. In this paper, we introduce ChronoTailor, a diffusion-based framework that generates temporally consistent videos while preserving fine-grained garment details. By employing a precise spatio-temporal attention mechanism to guide the integration of fine-grained garment features, ChronoTailor achieves robust try-on performance. First, ChronoTailor leverages region-aware spatial guidance to steer the evolution of spatial attention and employs an attention-driven temporal feature fusion mechanism to generate more continuous temporal features. This dual approach not only enables fine-grained local editing but also effectively mitigates artifacts arising from video dynamics. Second, ChronoTailor integrates multi-scale garment features to preserve low-level visual details and incorporates a garment-pose feature alignment to ensure temporal continuity during dynamic motion. Additionally, we collect StyleDress, a new dataset featuring intricate garments, varied environments, and diverse poses, offering advantages over existing public datasets, and will be publicly available for research. Extensive experiments show that ChronoTailor maintains spatio-temporal continuity and preserves garment details during motion, significantly outperforming previous methods.
Related papers
- Once Is Enough: Lightweight DiT-Based Video Virtual Try-On via One-Time Garment Appearance Injection [21.00674585489938]
Video virtual try-on aims to replace the clothing of a person in a video with a target garment.<n>We propose OIE (Once is Enough), a virtual try-on strategy based on first-frame clothing replacement.
arXiv Detail & Related papers (2025-10-09T01:13:37Z) - Stable Video-Driven Portraits [52.008400639227034]
Animation aims to generate photo-realistic videos from a single source image by reenacting the expression and pose from a driving video.<n>Recent advances using diffusion models have demonstrated improved quality but remain constrained by weak control signals and architectural limitations.<n>We propose a novel diffusion based framework that leverages masked facial regions specifically the eyes, nose, and mouth from the driving video as strong motion control cues.
arXiv Detail & Related papers (2025-09-22T08:11:08Z) - PoseGen: In-Context LoRA Finetuning for Pose-Controllable Long Human Video Generation [4.417342791754854]
We introduce PoseGen, a novel framework that generates arbitrarily long videos of a specific subject from a single reference image and a driving pose sequence.<n>Our core innovation is an in-context LoRA finetuning strategy that injects subject appearance at the token level for identity preservation.<n>We show that PoseGen significantly outperforms state-of-the-art methods in identity fidelity, pose accuracy, and its unique ability to produce coherent, artifact-free videos of unlimited duration.
arXiv Detail & Related papers (2025-08-07T07:19:02Z) - MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [16.0505428363005]
We propose MagicTryOn, a video virtual try-on framework built upon the large-scale video diffusion Transformer.<n>We replace the U-Net architecture with a diffusion Transformer and combine full self-attention to model the garment consistency of videos.<n>Our method outperforms existing SOTA methods in comprehensive evaluations and generalizes to in-the-wild scenarios.
arXiv Detail & Related papers (2025-05-27T15:22:02Z) - Tracktention: Leveraging Point Tracking to Attend Videos Faster and Better [61.381599921020175]
Temporal consistency is critical in video prediction to ensure that outputs are coherent and free of artifacts.<n>Traditional methods, such as temporal attention and 3D convolution, may struggle with significant object motion.<n>We propose the Tracktention Layer, a novel architectural component that explicitly integrates motion information using point tracks.
arXiv Detail & Related papers (2025-03-25T17:58:48Z) - Exploring Temporally-Aware Features for Point Tracking [58.63091479730935]
Chrono is a feature backbone specifically designed for point tracking with built-in temporal awareness.<n>Chrono achieves state-of-the-art performance in a refiner-free setting on the TAP-Vid-DAVIS and TAP-Vid-Kinetics datasets.
arXiv Detail & Related papers (2025-01-21T15:39:40Z) - Dynamic Try-On: Taming Video Virtual Try-on with Dynamic Attention Mechanism [52.9091817868613]
Video try-on is a promising area for its tremendous real-world potential.<n>Previous research has primarily focused on transferring product clothing images to videos with simple human poses.<n>We propose a novel video try-on framework based on Diffusion Transformer(DiT), named Dynamic Try-On.
arXiv Detail & Related papers (2024-12-13T03:20:53Z) - Trajectory Attention for Fine-grained Video Motion Control [20.998809534747767]
This paper introduces trajectory attention, a novel approach that performs attention along available pixel trajectories for fine-grained camera motion control.<n>We show that our approach can be extended to other video motion control tasks, such as first-frame-guided video editing.
arXiv Detail & Related papers (2024-11-28T18:59:51Z) - Garment Animation NeRF with Color Editing [6.357662418254495]
We propose a novel approach to synthesize garment animations from body motion sequences without the need for an explicit garment proxy.
Our approach infers garment dynamic features from body motion, providing a preliminary overview of garment structure.
We demonstrate the generalizability of our method across unseen body motions and camera views, ensuring detailed structural consistency.
arXiv Detail & Related papers (2024-07-29T08:17:05Z) - AniDress: Animatable Loose-Dressed Avatar from Sparse Views Using
Garment Rigging Model [58.035758145894846]
We introduce AniDress, a novel method for generating animatable human avatars in loose clothes using very sparse multi-view videos.
A pose-driven deformable neural radiance field conditioned on both body and garment motions is introduced, providing explicit control of both parts.
Our method is able to render natural garment dynamics that deviate highly from the body and well to generalize to both unseen views and poses.
arXiv Detail & Related papers (2024-01-27T08:48:18Z) - Motion Guided Deep Dynamic 3D Garments [45.711340917768766]
We focus on motion guided dynamic 3D garments, especially for loose garments.
In a data-driven setup, we first learn a generative space of plausible garment geometries.
We show improvements over multiple state-of-the-art alternatives.
arXiv Detail & Related papers (2022-09-23T07:17:46Z) - STAU: A SpatioTemporal-Aware Unit for Video Prediction and Beyond [78.129039340528]
We propose a temporal-aware unit (STAU) for video prediction and beyond.
Our STAU can outperform other methods on all tasks in terms of performance and efficiency.
arXiv Detail & Related papers (2022-04-20T13:42:51Z) - Exploring Rich and Efficient Spatial Temporal Interactions for Real Time
Video Salient Object Detection [87.32774157186412]
Main stream methods formulate their video saliency mainly from two independent venues, i.e., the spatial and temporal branches.
In this paper, we propose atemporal network to achieve such improvement in a full interactive fashion.
Our method is easy to implement yet effective, achieving high quality video saliency detection in real-time speed with 50 FPS.
arXiv Detail & Related papers (2020-08-07T03:24:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.