1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On
- URL: http://arxiv.org/abs/2501.05369v1
- Date: Thu, 09 Jan 2025 16:49:04 GMT
- Title: 1-2-1: Renaissance of Single-Network Paradigm for Virtual Try-On
- Authors: Shuliang Ning, Yipeng Qin, Xiaoguang Han,
- Abstract summary: We propose a novel single-network VTON method that overcomes the limitations of existing techniques.<n>Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs.<n>Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches.
- Score: 17.226542332700607
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Virtual Try-On (VTON) has become a crucial tool in ecommerce, enabling the realistic simulation of garments on individuals while preserving their original appearance and pose. Early VTON methods relied on single generative networks, but challenges remain in preserving fine-grained garment details due to limitations in feature extraction and fusion. To address these issues, recent approaches have adopted a dual-network paradigm, incorporating a complementary "ReferenceNet" to enhance garment feature extraction and fusion. While effective, this dual-network approach introduces significant computational overhead, limiting its scalability for high-resolution and long-duration image/video VTON applications. In this paper, we challenge the dual-network paradigm by proposing a novel single-network VTON method that overcomes the limitations of existing techniques. Our method, namely MNVTON, introduces a Modality-specific Normalization strategy that separately processes text, image and video inputs, enabling them to share the same attention layers in a VTON network. Extensive experimental results demonstrate the effectiveness of our approach, showing that it consistently achieves higher-quality, more detailed results for both image and video VTON tasks. Our results suggest that the single-network paradigm can rival the performance of dualnetwork approaches, offering a more efficient alternative for high-quality, scalable VTON applications.
Related papers
- SimpleGVR: A Simple Baseline for Latent-Cascaded Video Super-Resolution [55.14432034345353]
We study key design principles for latter cascaded video super-resolution models, which are underexplored currently.<n>First, we propose two strategies to generate training pairs that better mimic the output characteristics of the base model, ensuring alignment between the VSR model and its upstream generator.<n>Second, we provide critical insights into VSR model behavior through systematic analysis of (1) timestep sampling strategies, (2) noise augmentation effects on low-resolution (LR) inputs.
arXiv Detail & Related papers (2025-06-24T17:57:26Z) - Fast State-Augmented Learning for Wireless Resource Allocation with Dual Variable Regression [83.27791109672927]
We show how a state-augmented graph neural network (GNN) parametrization for the resource allocation policy circumvents the drawbacks of the ubiquitous dual subgradient methods.<n>Lagrangian maximizing state-augmented policies are learned during the offline training phase.<n>We prove a convergence result and an exponential probability bound on the excursions of the dual function (iterate) optimality gaps.
arXiv Detail & Related papers (2025-06-23T15:20:58Z) - DS-VTON: High-Quality Virtual Try-on via Disentangled Dual-Scale Generation [38.499761393356124]
DS-VTON is a dual-scale virtual try-on framework that disentangles objectives for more effective modeling.<n>Our method adopts a fully mask-free generation paradigm, eliminating reliance on human parsing maps or segmentation masks.
arXiv Detail & Related papers (2025-06-01T08:52:57Z) - UniViTAR: Unified Vision Transformer with Native Resolution [37.63387029787732]
We introduce UniViTAR, a family of homogeneous vision foundation models tailored for unified visual modality and native resolution scenario.
A progressive training paradigm is introduced, which strategically combines two core mechanisms.
In parallel, a hybrid training framework further synergizes sigmoid-based contrastive loss with feature distillation from a frozen teacher model.
arXiv Detail & Related papers (2025-04-02T14:59:39Z) - BHViT: Binarized Hybrid Vision Transformer [53.38894971164072]
Model binarization has made significant progress in enabling real-time and energy-efficient computation for convolutional neural networks (CNN)
We propose BHViT, a binarization-friendly hybrid ViT architecture and its full binarization model with the guidance of three important observations.
Our proposed algorithm achieves SOTA performance among binary ViT methods.
arXiv Detail & Related papers (2025-03-04T08:35:01Z) - CatVTON: Concatenation Is All You Need for Virtual Try-On with Diffusion Models [77.39903417768967]
CatVTON is a virtual try-on diffusion model that transfers in-shop or worn garments of arbitrary categories to target individuals.
CatVTON consists only of a VAE and a simplified denoising UNet, removing redundant image and text encoders.
Experiments demonstrate that CatVTON achieves superior qualitative and quantitative results compared to baseline methods.
arXiv Detail & Related papers (2024-07-21T11:58:53Z) - D$^4$-VTON: Dynamic Semantics Disentangling for Differential Diffusion based Virtual Try-On [32.73798955587999]
D$4$-VTON is an innovative solution for image-based virtual try-on.
We address challenges from previous studies, such as semantic inconsistencies before and after garment warping.
arXiv Detail & Related papers (2024-07-21T10:40:53Z) - GFN: A graph feedforward network for resolution-invariant reduced operator learning in multifidelity applications [0.0]
This work presents a novel resolution-invariant model order reduction strategy for multifidelity applications.
We base our architecture on a novel neural network layer developed in this work, the graph feedforward network.
We exploit the method's capability of training and testing on different mesh sizes in an autoencoder-based reduction strategy for parametrised partial differential equations.
arXiv Detail & Related papers (2024-06-05T18:31:37Z) - A-SDM: Accelerating Stable Diffusion through Model Assembly and Feature Inheritance Strategies [51.7643024367548]
Stable Diffusion Model is a prevalent and effective model for text-to-image (T2I) and image-to-image (I2I) generation.
This study focuses on reducing redundant computation in SDM and optimizing the model through both tuning and tuning-free methods.
arXiv Detail & Related papers (2024-05-31T21:47:05Z) - Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet)
AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition.
Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z) - When Parameter-efficient Tuning Meets General-purpose Vision-language
Models [65.19127815275307]
PETAL revolutionizes the training process by requiring only 0.5% of the total parameters, achieved through a unique mode approximation technique.
Our experiments reveal that PETAL not only outperforms current state-of-the-art methods in most scenarios but also surpasses full fine-tuning models in effectiveness.
arXiv Detail & Related papers (2023-12-16T17:13:08Z) - Full-Duplex Strategy for Video Object Segmentation [141.43983376262815]
Full- Strategy Network (FSNet) is a novel framework for video object segmentation (VOS)
Our FSNet performs the crossmodal feature-passing (i.e., transmission and receiving) simultaneously before fusion decoding stage.
We show that our FSNet outperforms other state-of-the-arts for both the VOS and video salient object detection tasks.
arXiv Detail & Related papers (2021-08-06T14:50:50Z) - Parameter Sharing Exploration and Hetero-Center based Triplet Loss for
Visible-Thermal Person Re-Identification [17.402673438396345]
This paper focuses on the visible-thermal cross-modality person re-identification (VT Re-ID) task.
Our proposed method distinctly outperforms the state-of-the-art methods by large margins.
arXiv Detail & Related papers (2020-08-14T07:40:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.