Mobile-VTON: High-Fidelity On-Device Virtual Try-On
- URL: http://arxiv.org/abs/2603.00947v2
- Date: Tue, 03 Mar 2026 09:12:16 GMT
- Title: Mobile-VTON: High-Fidelity On-Device Virtual Try-On
- Authors: Zhenchen Wan, Ce Chen, Runqi Lin, Jiaxin Huang, Tianxi Chen, Yanwu Xu, Tongliang Liu, Mingming Gong,
- Abstract summary: Mobile-VTON is a high-quality, privacy-preserving framework for virtual try-on.<n>It enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image.
- Score: 75.5009105664896
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Virtual try-on (VTON) has recently achieved impressive visual fidelity, but most existing systems require uploading personal photos to cloud-based GPUs, raising privacy concerns and limiting on-device deployment. To address this, we present Mobile-VTON, a high-quality, privacy-preserving framework that enables fully offline virtual try-on on commodity mobile devices using only a single user image and a garment image. Mobile-VTON introduces a modular TeacherNet-GarmentNet-TryonNet (TGT) architecture that integrates knowledge distillation, garment-conditioned generation, and garment alignment into a unified pipeline optimized for on-device efficiency. Within this framework, we propose a Feature-Guided Adversarial (FGA) Distillation strategy that combines teacher supervision with adversarial learning to better match real-world image distributions. GarmentNet is trained with a trajectory-consistency loss to preserve garment semantics across diffusion steps, while TryonNet uses latent concatenation and lightweight cross-modal conditioning to enable robust garment-to-person alignment without large-scale pretraining. By combining these components, Mobile-VTON achieves high-fidelity generation with low computational overhead. Experiments on VITON-HD and DressCode at 1024 x 768 show that it matches or outperforms strong server-based baselines while running entirely offline. These results demonstrate that high-quality VTON is not only feasible but also practical on-device, offering a secure solution for real-world applications.
Related papers
- OmniVTON++: Training-Free Universal Virtual Try-On with Principal Pose Guidance [85.23143742905695]
Image-based Virtual Try-On (VTON) concerns the synthesis of realistic person imagery through garment re-rendering under human pose and body constraints.<n>We present OmniVTON++, a training-free VTON framework designed for universal applicability.
arXiv Detail & Related papers (2026-02-16T08:27:43Z) - OmniVTON: Training-Free Universal Virtual Try-On [53.31945401098557]
Image-based Virtual Try-On (VTON) techniques rely on either supervised in-shop approaches, or unsupervised in-the-wild methods, which improve adaptability but remain constrained by data biases and limited universality.<n>We propose OmniVTON, the first training-free universal VTON framework that decouples garment and pose conditioning to achieve both texture fidelity and pose consistency across diverse settings.
arXiv Detail & Related papers (2025-07-20T16:37:53Z) - ITVTON: Virtual Try-On Diffusion Transformer Based on Integrated Image and Text [1.7071356210178177]
ITVTON is an efficient framework that leverages the Diffusion Transformer (DiT) as its single generator to improve image fidelity.<n>ITVTON effectively captures garment and person images along the width dimension and incorporating textual descriptions from both.<n>Experiments on 10,257 image pairs from IGPair confirm ITVTON's robustness in real-world scenarios.
arXiv Detail & Related papers (2025-01-28T07:24:15Z) - 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On [17.226542332700607]
We propose a novel single-network VTON method that overcomes the limitations of existing techniques.<n>Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs.<n>Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches.
arXiv Detail & Related papers (2025-01-09T16:49:04Z) - IMAGDressing-v1: Customizable Virtual Dressing [58.44155202253754]
IMAGDressing-v1 is a virtual dressing task that generates freely editable human images with fixed garments and optional conditions.
IMAGDressing-v1 incorporates a garment UNet that captures semantic features from CLIP and texture features from VAE.
We present a hybrid attention module, including a frozen self-attention and a trainable cross-attention, to integrate garment features from the garment UNet into a frozen denoising UNet.
arXiv Detail & Related papers (2024-07-17T16:26:30Z) - OOTDiffusion: Outfitting Fusion based Latent Diffusion for Controllable
Virtual Try-on [7.46772222515689]
OOTDiffusion is a novel network architecture for realistic and controllable image-based virtual try-on.
We leverage the power of pretrained latent diffusion models, designing an outfitting UNet to learn the garment detail features.
Our experiments on the VITON-HD and Dress Code datasets demonstrate that OOTDiffusion efficiently generates high-quality try-on results.
arXiv Detail & Related papers (2024-03-04T07:17:44Z) - Stitched ViTs are Flexible Vision Backbones [51.441023711924835]
We are inspired by stitchable neural networks (SN-Net) to produce a single model that covers richworks by stitching pretrained model families.
We introduce SN-Netv2, a systematically improved model stitching framework to facilitate downstream task adaptation.
SN-Netv2 demonstrates superior performance over SN-Netv1 on downstream dense predictions and shows strong ability as a flexible vision backbone.
arXiv Detail & Related papers (2023-06-30T22:05:34Z) - PASTA-GAN++: A Versatile Framework for High-Resolution Unpaired Virtual
Try-on [70.12285433529998]
PASTA-GAN++ is a versatile system for high-resolution unpaired virtual try-on.
It supports unsupervised training, arbitrary garment categories, and controllable garment editing.
arXiv Detail & Related papers (2022-07-27T11:47:49Z) - EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision
Transformers [88.52500757894119]
Self-attention based vision transformers (ViTs) have emerged as a very competitive architecture alternative to convolutional neural networks (CNNs) in computer vision.
We introduce EdgeViTs, a new family of light-weight ViTs that, for the first time, enable attention-based vision models to compete with the best light-weight CNNs.
arXiv Detail & Related papers (2022-05-06T18:17:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.