Related papers: From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion

From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion

URL: http://arxiv.org/abs/2204.03842v5
Date: Mon, 23 Dec 2024 19:10:36 GMT
Title: From 2D Images to 3D Model:Weakly Supervised Multi-View Face Reconstruction with Deep Fusion
Authors: Weiguang Zhao, Chaolong Yang, Jianan Ye, Rui Zhang, Yuyao Yan, Xi Yang, Bin Dong, Amir Hussain, Kaizhu Huang,
Abstract summary: We propose a novel pipeline called Deep Fusion MVR to explore the feature correspondences between multi-view images and reconstruct high-precision 3D faces.<n>Specifically, we present a novel multi-view feature fusion backbone that utilizes face masks to align features from multiple encoders.<n>We develop one concise face mask mechanism that facilitates multi-view feature fusion and facial reconstruction.
Score: 25.068822438649928
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: While weakly supervised multi-view face reconstruction (MVR) is garnering increased attention, one critical issue still remains open: how to effectively interact and fuse multiple image information to reconstruct high-precision 3D models. In this regard, we propose a novel pipeline called Deep Fusion MVR (DF-MVR) to explore the feature correspondences between multi-view images and reconstruct high-precision 3D faces. Specifically, we present a novel multi-view feature fusion backbone that utilizes face masks to align features from multiple encoders and integrates one multi-layer attention mechanism to enhance feature interaction and fusion, resulting in one unified facial representation. Additionally, we develop one concise face mask mechanism that facilitates multi-view feature fusion and facial reconstruction by identifying common areas and guiding the network's focus on critical facial features (e.g., eyes, brows, nose, and mouth). Experiments on Pixel-Face and Bosphorus datasets indicate the superiority of our model. Without 3D annotation, DF-MVR achieves 5.2% and 3.0% RMSE improvement over the existing weakly supervised MVRs respectively on Pixel-Face and Bosphorus dataset. Code will be available publicly at https://github.com/weiguangzhao/DF_MVR.

Related papers

CDI3D: Cross-guided Dense-view Interpolation for 3D Reconstruction [25.468907201804093]
Large Reconstruction Models (LRMs) have shown great promise in leveraging multi-view images generated by 2D diffusion models to extract 3D content. However, 2D diffusion models often struggle to produce dense images with strong multi-view consistency. We present CDI3D, a feed-forward framework designed for efficient, high-quality image-to-3D generation with view.
arXiv Detail & Related papers (2025-03-11T03:08:43Z)
Refine3DNet: Scaling Precision in 3D Object Reconstruction from Multi-View RGB Images using Attention [2.037112541541094]
We introduce a hybrid strategy featuring a visual auto-encoder with self-attention mechanisms and a 3D refiner network. Our approach, combined with JTSO, outperforms state-of-the-art techniques in single and multi-view 3D reconstruction.
arXiv Detail & Related papers (2024-12-01T08:53:39Z)
Fancy123: One Image to High-Quality 3D Mesh Generation via Plug-and-Play Deformation [22.5996658181606]
We propose Fancy123, featuring two enhancement modules and an unprojection operation to address the above three issues. The appearance enhancement module deforms the 2D multiview images to realign pixels for better multiview consistency. The fidelity enhancement module deforms the 3D mesh to match the input image. The unprojection of the input image and deformed multiview images onto LRM's generated mesh ensures high clarity.
arXiv Detail & Related papers (2024-11-25T08:31:55Z)
Flex3D: Feed-Forward 3D Generation With Flexible Reconstruction Model And Input View Curation [61.040832373015014]
We propose Flex3D, a novel framework for generating high-quality 3D content from text, single images, or sparse view images. We employ a fine-tuned multi-view image diffusion model and a video diffusion model to generate a pool of candidate views, enabling a rich representation of the target 3D object. In the second stage, the curated views are fed into a Flexible Reconstruction Model (FlexRM), built upon a transformer architecture that can effectively process an arbitrary number of inputs.
arXiv Detail & Related papers (2024-10-01T17:29:43Z)
Pixel-Aligned Multi-View Generation with Depth Guided Decoder [86.1813201212539]
We propose a novel method for pixel-level image-to-multi-view generation. Unlike prior work, we incorporate attention layers across multi-view images in the VAE decoder of a latent video diffusion model. Our model enables better pixel alignment across multi-view images.
arXiv Detail & Related papers (2024-08-26T04:56:41Z)
MVGamba: Unify 3D Content Generation as State Space Sequence Modeling [150.80564081817786]
We introduce MVGamba, a general and lightweight Gaussian reconstruction model featuring a multi-view Gaussian reconstructor. With off-the-detail multi-view diffusion models integrated, MVGamba unifies 3D generation tasks from a single image, sparse images, or text prompts. Experiments demonstrate that MVGamba outperforms state-of-the-art baselines in all 3D content generation scenarios with approximately only $0.1times$ of the model size.
arXiv Detail & Related papers (2024-06-10T15:26:48Z)
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data [80.92268916571712]
A critical bottleneck is the scarcity of high-quality 3D objects with detailed captions. We propose Bootstrap3D, a novel framework that automatically generates an arbitrary quantity of multi-view images. We have generated 1 million high-quality synthetic multi-view images with dense descriptive captions.
arXiv Detail & Related papers (2024-05-31T17:59:56Z)
Envision3D: One Image to 3D with Anchor Views Interpolation [18.31796952040799]
We present Envision3D, a novel method for efficiently generating high-quality 3D content from a single image. It is capable of generating high-quality 3D content in terms of texture and geometry, surpassing previous image-to-3D baseline methods.
arXiv Detail & Related papers (2024-03-13T18:46:33Z)
2L3: Lifting Imperfect Generated 2D Images into Accurate 3D [16.66666619143761]
Multi-view (MV) 3D reconstruction is a promising solution to fuse generated MV images into consistent 3D objects. However, the generated images usually suffer from inconsistent lighting, misaligned geometry, and sparse views, leading to poor reconstruction quality. We present a novel 3D reconstruction framework that leverages intrinsic decomposition guidance, transient-mono prior guidance, and view augmentation to cope with the three issues.
arXiv Detail & Related papers (2024-01-29T02:30:31Z)
PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection [26.03582038710992]
Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world. We propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects.
arXiv Detail & Related papers (2023-03-14T17:58:03Z)
High-fidelity 3D GAN Inversion by Pseudo-multi-view Optimization [51.878078860524795]
We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content.
arXiv Detail & Related papers (2022-11-28T18:59:52Z)
Vision Transformer for NeRF-Based View Synthesis from a Single Input Image [49.956005709863355]
We propose to leverage both the global and local features to form an expressive 3D representation. To synthesize a novel view, we train a multilayer perceptron (MLP) network conditioned on the learned 3D representation to perform volume rendering. Our method can render novel views from only a single input image and generalize across multiple object categories using a single model.
arXiv Detail & Related papers (2022-07-12T17:52:04Z)
PointMCD: Boosting Deep Point Cloud Encoders via Multi-view Cross-modal Distillation for 3D Shape Recognition [55.38462937452363]
We propose a unified multi-view cross-modal distillation architecture, including a pretrained deep image encoder as the teacher and a deep point encoder as the student. By pair-wise aligning multi-view visual and geometric descriptors, we can obtain more powerful deep point encoders without exhausting and complicated network modification.
arXiv Detail & Related papers (2022-07-07T07:23:20Z)
AFNet-M: Adaptive Fusion Network with Masks for 2D+3D Facial Expression Recognition [1.8604727699812171]
2D+3D facial expression recognition (FER) can effectively cope with illumination changes and pose variations. Most deep learning-based approaches employ the simple fusion strategy. We propose the adaptive fusion network with masks (AFNet-M) for 2D+3D FER.
arXiv Detail & Related papers (2022-05-24T04:56:55Z)
VPFusion: Joint 3D Volume and Pixel-Aligned Feature Fusion for Single and Multi-view 3D Reconstruction [23.21446438011893]
VPFusionattains high-quality reconstruction using both - 3D feature volume to capture 3D-structure-aware context. Existing approaches use RNN, feature pooling, or attention computed independently in each view for multi-view fusion. We show improved multi-view feature fusion by establishing transformer-based pairwise view association.
arXiv Detail & Related papers (2022-03-14T23:30:58Z)
VoRTX: Volumetric 3D Reconstruction With Transformers for Voxelwise View Selection and Fusion [68.68537312256144]
VoRTX is an end-to-end volumetric 3D reconstruction network using transformers for wide-baseline, multi-view feature fusion. We train our model on ScanNet and show that it produces better reconstructions than state-of-the-art methods.
arXiv Detail & Related papers (2021-12-01T02:18:11Z)
Direct Multi-view Multi-person 3D Pose Estimation [138.48139701871213]
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. We show experimentally that our MvP model outperforms the state-of-the-art methods on several benchmarks while being much more efficient.
arXiv Detail & Related papers (2021-11-07T13:09:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.