Fugu-MT 論文翻訳(概要): JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

論文の概要: JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

arxiv url: http://arxiv.org/abs/2508.17614v1
Date: Mon, 25 Aug 2025 02:43:57 GMT
ステータス: 翻訳完了
システム内更新日: 2025-08-26 18:43:45.608026
Title: JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on
Title（参考訳）: JCo-MVTON:マスクフリー仮想試行用ジョイント制御可能な多モード拡散変換器
Authors: Aowen Wang, Wei Li, Hao Luo, Mengxing Ao, Chenyu Zhu, Xinyang Li, Fan Wang,
Abstract要約: JCo-MVTONは、拡散に基づく画像生成とマルチモーダル条件融合を統合することで制限を克服する新しいフレームワークである。 DressCodeなどの公開ベンチマークで最先端のパフォーマンスを実現し、測定値と人的評価の両方において、既存のメソッドよりも大幅に優れています。
参考スコア（独自算出の注目度）: 15.59886380067986
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Virtual try-on systems have long been hindered by heavy reliance on human body masks, limited fine-grained control over garment attributes, and poor generalization to real-world, in-the-wild scenarios. In this paper, we propose JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On), a novel framework that overcomes these limitations by integrating diffusion-based image generation with multi-modal conditional fusion. Built upon a Multi-Modal Diffusion Transformer (MM-DiT) backbone, our approach directly incorporates diverse control signals -- such as the reference person image and the target garment image -- into the denoising process through dedicated conditional pathways that fuse features within the self-attention layers. This fusion is further enhanced with refined positional encodings and attention masks, enabling precise spatial alignment and improved garment-person integration. To address data scarcity and quality, we introduce a bidirectional generation strategy for dataset construction: one pipeline uses a mask-based model to generate realistic reference images, while a symmetric ``Try-Off'' model, trained in a self-supervised manner, recovers the corresponding garment images. The synthesized dataset undergoes rigorous manual curation, allowing iterative improvement in visual fidelity and diversity. Experiments demonstrate that JCo-MVTON achieves state-of-the-art performance on public benchmarks including DressCode, significantly outperforming existing methods in both quantitative metrics and human evaluations. Moreover, it shows strong generalization in real-world applications, surpassing commercial systems.
Abstract（参考訳）: 仮想トライオンシステムは、人体マスクへの強い依存、衣服属性のきめ細かな制御の制限、現実世界の非現実的なシナリオへの一般化の欠如によって、長い間妨げられてきた。本稿では,JCo-MVTON (Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-On)を提案する。本手法は,MM-DiT(Multi-Modal Diffusion Transformer)のバックボーンをベースとして,参照人物像や対象衣服像などの多様な制御信号を,自己注意層内の特徴を融合する専用条件付き経路を通じて復調処理に組み込む。この融合は、より洗練された位置エンコーディングとアテンションマスクによってさらに強化され、正確な空間アライメントと衣服と人体の統合が向上する。データ不足と品質に対処するために、データセット構築のための双方向生成戦略を導入する。一方のパイプラインは、マスクベースのモデルを使用して現実的な参照画像を生成する一方、対称な `‘Try-Off'' モデルは、自己教師付きで訓練され、対応する衣服画像を復元する。合成されたデータセットは厳密な手作業によるキュレーションを受けており、視覚的忠実度と多様性を反復的に改善することができる。実験により、JCo-MVTONは、DressCodeを含む公開ベンチマークで最先端のパフォーマンスを達成し、測定値と人的評価の両方で既存の方法よりも大幅に優れています。さらに、実世界のアプリケーションにおいて、商用システムを超えた強力な一般化を示す。

論文の概要: JCo-MVTON: Jointly Controllable Multi-Modal Diffusion Transformer for Mask-Free Virtual Try-on

関連論文リスト