Fugu-MT 論文翻訳(概要): iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

論文の概要: iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

arxiv url: http://arxiv.org/abs/2605.21431v1
Date: Wed, 20 May 2026 17:23:32 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-21 19:19:56.810314
Title: iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance
Title（参考訳）: iTryOn:空間的セマンティック誘導による対話型ビデオバーチャルトライオンのマスタリング
Authors: Jun Zheng, Zhengze Xu, Mengting Chen, Jing Wang, Jinsong Lan, Xiaoyong Zhu, Kaifu Zhang, Bo Zheng, Xiaodan Liang,
Abstract要約: Video Virtual Try-On(VVT)は、ビデオの中の人の衣服を新しい服にシームレスに置き換えることを目的としている。対話型ビデオバーチャルトライ-オン(Interactive VVT)では,映像中の被験者が衣服に積極的に関与する。大規模ビデオ拡散変換器上に構築された新しいフレームワークiTryOnを提案する。
参考スコア（独自算出の注目度）: 51.550949809895975
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Video Virtual Try-On (VVT) aims to seamlessly replace a garment on a person in a video with a new one. While existing methods have made significant strides in maintaining temporal consistency, they are predominantly confined to non-interactive scenarios where models merely showcase garments. This limitation overlooks a crucial aspect of real-world apparel presentation: active human-garment interaction. To bridge this gap, we introduce and formalize a new challenging task: Interactive Video Virtual Try-On (Interactive VVT), where subjects in the video actively engage with their clothing. This task introduces unique challenges beyond simple texture preservation, including: (1) resolving the semantic ambiguity of interactions from standard pose information, and (2) learning complex garment deformations from video where interactive moments are sparse and brief. To address these challenges, we propose iTryOn, a novel framework built upon a large-scale video diffusion Transformer. iTryOn pioneers a multi-level interaction injection mechanism to guide the generation of complex dynamics. At the spatial level, we introduce a garment-agnostic 3D hand prior to provide fine-grained guidance for precise hand-garment contact, effectively resolving spatial ambiguity. At the semantic level, iTryOn leverages global captions for overall context and time-stamped action captions for localized interactions, synchronized via our novel Action-aware Rotational Position Embedding (A-RoPE). Extensive experiments demonstrate that iTryOn not only achieves state-of-the-art performance on traditional VVT benchmarks but also establishes a commanding lead in the new interactive setting, marking a significant step towards more dynamic and controllable virtual try-on experiences.
Abstract（参考訳）: Video Virtual Try-On(VVT)は、ビデオの中の人の衣服を新しい服にシームレスに置き換えることを目的としている。既存の手法は時間的一貫性を維持するために大きな進歩を遂げてきたが、それらは主に、モデルが単に衣服を展示する非対話的なシナリオに限られている。この制限は、実世界のアパレルプレゼンテーションにおいて重要な側面であるアクティブヒューマン・ガーメント・インタラクションを見落としている。このギャップを埋めるために、我々は新しい挑戦的なタスク、Interactive Video Virtual Try-On (Interactive VVT)を導入し、フォーマル化する。本課題は,(1)標準ポーズ情報からインタラクションの意味的あいまいさを解消すること,(2)インタラクティブなモーメントがスパースで簡潔なビデオから複雑な衣服の変形を学習すること,など,単純なテクスチャ保存を超えてユニークな課題を導入する。これらの課題に対処するために,大規模なビデオ拡散変換器上に構築された新しいフレームワークiTryOnを提案する。 iTryOnは、複雑なダイナミクスの生成を導くためのマルチレベルインタラクションインジェクションメカニズムのパイオニアだ。空間レベルでは,空間的あいまいさを効果的に解消する,精密な手着接触のためのきめ細かいガイダンスを提供する前に,衣服に依存しない3D手を導入する。セマンティックレベルでは、iTryOnは、グローバルキャプションをコンテキスト全体と時間スタンプされたアクションキャプションに利用し、ローテーション位置埋め込み(A-RoPE)によって同期する。大規模な実験では、iTryOnは従来のVVTベンチマークで最先端のパフォーマンスを達成するだけでなく、新たなインタラクティブな設定においてコマンドリードを確立し、よりダイナミックで制御可能な仮想トライオンエクスペリエンスに向けた重要なステップをマークしている。

論文の概要: iTryOn: Mastering Interactive Video Virtual Try-On with Spatial-Semantic Guidance

関連論文リスト