The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
- URL: http://arxiv.org/abs/2512.20340v1
- Date: Tue, 23 Dec 2025 13:15:31 GMT
- Title: The devil is in the details: Enhancing Video Virtual Try-On via Keyframe-Driven Details Injection
- Authors: Qingdong He, Xueqin Chen, Yanjie Pan, Peng Tang, Pengcheng Xu, Zhenye Gan, Chengjie Wang, Xiaobin Hu, Jiangning Zhang, Yabiao Wang,
- Abstract summary: KeyTailor is a novel framework for realistic try-on video.<n>It uses an instruction-guided sampling strategy to filter informative frames from the input video.<n>Our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments.
- Score: 90.30501870268911
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Although diffusion transformer (DiT)-based video virtual try-on (VVT) has made significant progress in synthesizing realistic videos, existing methods still struggle to capture fine-grained garment dynamics and preserve background integrity across video frames. They also incur high computational costs due to additional interaction modules introduced into DiTs, while the limited scale and quality of existing public datasets also restrict model generalization and effective training. To address these challenges, we propose a novel framework, KeyTailor, along with a large-scale, high-definition dataset, ViT-HD. The core idea of KeyTailor is a keyframe-driven details injection strategy, motivated by the fact that keyframes inherently contain both foreground dynamics and background consistency. Specifically, KeyTailor adopts an instruction-guided keyframe sampling strategy to filter informative frames from the input video. Subsequently,two tailored keyframe-driven modules, the garment details enhancement module and the collaborative background optimization module, are employed to distill garment dynamics into garment-related latents and to optimize the integrity of background latents, both guided by keyframes.These enriched details are then injected into standard DiT blocks together with pose, mask, and noise latents, enabling efficient and realistic try-on video synthesis. This design ensures consistency without explicitly modifying the DiT architecture, while simultaneously avoiding additional complexity. In addition, our dataset ViT-HD comprises 15, 070 high-quality video samples at a resolution of 810*1080, covering diverse garments. Extensive experiments demonstrate that KeyTailor outperforms state-of-the-art baselines in terms of garment fidelity and background integrity across both dynamic and static scenarios.
Related papers
- Distill Video Datasets into Images [28.61426017935629]
Single-Frame Video set Distillation (SFVD) is a framework that distills videos into highly informative frames for each class.<n>SFVD substantially outperforms prior methods, achieving improvements of up to 5.3% on MiniUCF.
arXiv Detail & Related papers (2025-12-16T17:33:41Z) - STAGE: Storyboard-Anchored Generation for Cinematic Multi-shot Narrative [55.05324155854762]
We introduce a SToryboard-Anchored GEneration workflow to reformulate the STAGE-based video generation task.<n>Instead of using sparses, we propose STEP2 to predict a structural storyboard composed of start-end frame pairs for each shot.<n>We also contribute the large-scale ConStoryBoard dataset, including high-quality movie clips with fine-grained narratives for story progression, cinematic attributes, and human preferences.
arXiv Detail & Related papers (2025-12-13T15:57:29Z) - Eevee: Towards Close-up High-resolution Video-based Virtual Try-on [23.37783900582483]
We introduce a high-resolution dataset for video-based virtual try-on.<n>This dataset includes full-shot and close-up try-on videos of real human models.<n>We propose a new garment consistency metric VGID that quantifies the preservation of both texture and structure.
arXiv Detail & Related papers (2025-11-24T10:19:56Z) - FrameMind: Frame-Interleaved Video Reasoning via Reinforcement Learning [65.42201665046505]
Current video understanding models rely on fixed frame sampling strategies, processing predetermined visual inputs regardless of the specific reasoning requirements of each question.<n>This static approach limits their ability to adaptively gather visual evidence, leading to suboptimal performance on tasks that require broad temporal coverage or fine-grained spatial detail.<n>We introduce FrameMind, an end-to-end framework trained with reinforcement learning that enables models to dynamically request visual information during reasoning through Frame-Interleaved Chain-of-Thought (FiCOT)<n>Unlike traditional approaches, FrameMind operates in multiple turns where the model alternates between textual reasoning and active visual perception, using tools to extract
arXiv Detail & Related papers (2025-09-28T17:59:43Z) - DreamVVT: Mastering Realistic Video Virtual Try-On in the Wild via a Stage-Wise Diffusion Transformer Framework [26.661935208583756]
virtual try-on (VVT) technology has garnered considerable academic interest owing to its promising applications in e-commerce advertising and entertainment.<n>We propose DreamVVT, which is inherently capable of leveraging diverse unpaired human-centric data to enhance adaptability in real-world scenarios.<n>In the first stage, we sample representative frames from the input video and utilize a multi-frame try-on model integrated with a vision-language model (VLM), to synthesize high-fidelity and semantically consistent try-on images.<n>In the second stage, skeleton maps together with fine-grained motion and appearance descriptions are
arXiv Detail & Related papers (2025-08-04T18:27:55Z) - LoViC: Efficient Long Video Generation with Context Compression [68.22069741704158]
We introduce LoViC, a DiT-based framework trained on million-scale open-domain videos.<n>At the core of our approach is FlexFormer, an expressive autoencoder that jointly compresses video and text into unified latent representations.
arXiv Detail & Related papers (2025-07-17T09:46:43Z) - PRISM: Video Dataset Condensation with Progressive Refinement and Insertion for Sparse Motion [22.804486552524885]
This paper introduces PRISM, Progressive Refinement and Insertion for Sparse Motion, for video dataset condensation.<n>Unlike the previous method that separates static content from dynamic motion, our method preserves the essential interdependence between these elements.<n>Our approach progressively refines and inserts frames to fully accommodate the motion in an action while achieving better performance but less storage.
arXiv Detail & Related papers (2025-05-28T16:42:10Z) - MagicTryOn: Harnessing Diffusion Transformer for Garment-Preserving Video Virtual Try-on [28.66545985357718]
Virtual Try-On (VVT) aims to synthesize garments that appear natural across consecutive frames, capturing both their dynamics and interactions with human cues.<n>Existing VVT methods still suffer from inadequate garment fidelity and limitedtemporal consistency.<n>We present MagicTryOn, a diffusion-transformer based framework for garment-constrained video virtual try-on.
arXiv Detail & Related papers (2025-05-27T15:22:02Z) - STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding [48.12128042470839]
We propose an integrated Spatial-TempOral dynamic Prompting (STOP) model.<n>It consists of two complementary modules, the intra-frame spatial prompting and inter-frame temporal prompting.<n>STOP consistently achieves superior performance against state-of-the-art methods.
arXiv Detail & Related papers (2025-03-20T09:16:20Z) - Identity-Preserving Text-to-Video Generation by Frequency Decomposition [52.19475797580653]
Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity.<n>This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in literature.<n>We propose ConsisID, a tuning-free DiT-based controllable IPT2V model to keep human identity consistent in the generated video.
arXiv Detail & Related papers (2024-11-26T13:58:24Z) - WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models [132.77237314239025]
Video virtual try-on aims to generate realistic sequences that maintain garment identity and adapt to a person's pose and body shape in source videos.
Traditional image-based methods, relying on warping and blending, struggle with complex human movements and occlusions.
We reconceptualize video try-on as a process of generating videos conditioned on garment descriptions and human motion.
Our solution, WildVidFit, employs image-based controlled diffusion models for a streamlined, one-stage approach.
arXiv Detail & Related papers (2024-07-15T11:21:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.