M3D-VTON: A Monocular-to-3D Virtual Try-On Network
- URL: http://arxiv.org/abs/2108.05126v1
- Date: Wed, 11 Aug 2021 10:05:17 GMT
- Title: M3D-VTON: A Monocular-to-3D Virtual Try-On Network
- Authors: Fuwei Zhao, Zhenyu Xie, Michael Kampffmeyer, Haoye Dong, Songfang Han,
Tianxiang Zheng, Tao Zhang, Xiaodan Liang
- Abstract summary: Existing 3D virtual try-on methods mainly rely on annotated 3D human shapes and garment templates.
We propose a novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the merits of both 2D and 3D approaches.
- Score: 62.77413639627565
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Virtual 3D try-on can provide an intuitive and realistic view for online
shopping and has a huge potential commercial value. However, existing 3D
virtual try-on methods mainly rely on annotated 3D human shapes and garment
templates, which hinders their applications in practical scenarios. 2D virtual
try-on approaches provide a faster alternative to manipulate clothed humans,
but lack the rich and realistic 3D representation. In this paper, we propose a
novel Monocular-to-3D Virtual Try-On Network (M3D-VTON) that builds on the
merits of both 2D and 3D approaches. By integrating 2D information efficiently
and learning a mapping that lifts the 2D representation to 3D, we make the
first attempt to reconstruct a 3D try-on mesh only taking the target clothing
and a person image as inputs. The proposed M3D-VTON includes three modules: 1)
The Monocular Prediction Module (MPM) that estimates an initial full-body depth
map and accomplishes 2D clothes-person alignment through a novel two-stage
warping procedure; 2) The Depth Refinement Module (DRM) that refines the
initial body depth to produce more detailed pleat and face characteristics; 3)
The Texture Fusion Module (TFM) that fuses the warped clothing with the
non-target body part to refine the results. We also construct a high-quality
synthesized Monocular-to-3D virtual try-on dataset, in which each person image
is associated with a front and a back depth map. Extensive experiments
demonstrate that the proposed M3D-VTON can manipulate and reconstruct the 3D
human body wearing the given clothing with compelling details and is more
efficient than other 3D approaches.
Related papers
- Any2Point: Empowering Any-modality Large Models for Efficient 3D Understanding [83.63231467746598]
We introduce Any2Point, a parameter-efficient method to empower any-modality large models (vision, language, audio) for 3D understanding.
We propose a 3D-to-any (1D or 2D) virtual projection strategy that correlates the input 3D points to the original 1D or 2D positions within the source modality.
arXiv Detail & Related papers (2024-04-11T17:59:45Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - 3D-LLM: Injecting the 3D World into Large Language Models [60.43823088804661]
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning.
We propose to inject the 3D world into large language models and introduce a new family of 3D-LLMs.
Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks.
arXiv Detail & Related papers (2023-07-24T17:59:02Z) - TANDEM3D: Active Tactile Exploration for 3D Object Recognition [16.548376556543015]
We propose TANDEM3D, a method that applies a co-training framework for 3D object recognition with tactile signals.
TANDEM3D is based on a novel encoder that builds 3D object representation from contact positions and normals using PointNet++.
Our method is trained entirely in simulation and validated with real-world experiments.
arXiv Detail & Related papers (2022-09-19T05:54:26Z) - Can We Solve 3D Vision Tasks Starting from A 2D Vision Transformer? [111.11502241431286]
Vision Transformers (ViTs) have proven to be effective in solving 2D image understanding tasks.
ViTs for 2D and 3D tasks have so far adopted vastly different architecture designs that are hardly transferable.
This paper demonstrates the appealing promise to understand the 3D visual world, using a standard 2D ViT architecture.
arXiv Detail & Related papers (2022-09-15T03:34:58Z) - Learning 3D Object Shape and Layout without 3D Supervision [26.575177430506667]
A 3D scene consists of a set of objects, each with a shape and a layout giving their position in space.
We propose a method that learns to predict 3D shape and layout for objects without any ground truth shape or layout information.
Our approach outperforms supervised approaches trained on smaller and less diverse datasets.
arXiv Detail & Related papers (2022-06-14T17:49:44Z) - 3D-to-2D Distillation for Indoor Scene Parsing [78.36781565047656]
We present a new approach that enables us to leverage 3D features extracted from large-scale 3D data repository to enhance 2D features extracted from RGB images.
First, we distill 3D knowledge from a pretrained 3D network to supervise a 2D network to learn simulated 3D features from 2D features during the training.
Second, we design a two-stage dimension normalization scheme to calibrate the 2D and 3D features for better integration.
Third, we design a semantic-aware adversarial training model to extend our framework for training with unpaired 3D data.
arXiv Detail & Related papers (2021-04-06T02:22:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.