GRAPE: Generalizable and Robust Multi-view Facial Capture
- URL: http://arxiv.org/abs/2407.10193v1
- Date: Sun, 14 Jul 2024 13:24:17 GMT
- Title: GRAPE: Generalizable and Robust Multi-view Facial Capture
- Authors: Jing Li, Di Kang, Zhenyu He,
- Abstract summary: Deep learning-based multi-view facial capture methods have shown impressive accuracy while being several orders of magnitude faster than a traditional mesh registration pipeline.
In this study, we aim to improve the generalization ability so that a trained model can be readily used for inference (i.e. capture new data) on a different camera array.
Experiments on the FaMoS and FaceScape datasets demonstrate the effectiveness of the proposed method.
- Score: 12.255610707737548
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Deep learning-based multi-view facial capture methods have shown impressive accuracy while being several orders of magnitude faster than a traditional mesh registration pipeline. However, the existing systems (e.g. TEMPEH) are strictly restricted to inference on the data captured by the same camera array used to capture their training data. In this study, we aim to improve the generalization ability so that a trained model can be readily used for inference (i.e. capture new data) on a different camera array. To this end, we propose a more generalizable initialization module to extract the camera array-agnostic 3D feature, including a visual hull-based head localization and a visibility-aware 3D feature aggregation module enabled by the visual hull. In addition, we propose an ``update-by-disagreement'' learning strategy to better handle data noise (e.g. inaccurate registration, scan noise) by discarding potentially inaccurate supervision signals during training. The resultant generalizable and robust topologically consistent multi-view facial capture system (GRAPE) can be readily used to capture data on a different camera array, reducing great effort on data collection and processing. Experiments on the FaMoS and FaceScape datasets demonstrate the effectiveness of the proposed method.
Related papers
- DiffCalib: Reformulating Monocular Camera Calibration as Diffusion-Based Dense Incident Map Generation [13.772897737616649]
We leverage the comprehensive visual knowledge embedded in pre-trained diffusion models to enable more robust and accurate monocular camera intrinsic estimation.
Our model achieves state-of-the-art performance, gaining up to a 40% reduction in prediction errors.
arXiv Detail & Related papers (2024-05-24T15:05:04Z) - Reconstructing Close Human Interactions from Multiple Views [38.924950289788804]
This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras.
We introduce a novel system to address these challenges.
Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies.
arXiv Detail & Related papers (2024-01-29T14:08:02Z) - FoVA-Depth: Field-of-View Agnostic Depth Estimation for Cross-Dataset
Generalization [57.98448472585241]
We propose a method to train a stereo depth estimation model on the widely available pinhole data.
We show strong generalization ability of our approach on both indoor and outdoor datasets.
arXiv Detail & Related papers (2024-01-24T20:07:59Z) - Instant Multi-View Head Capture through Learnable Registration [62.70443641907766]
Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow.
We introduce TEMPEH to directly infer 3D heads in dense correspondence from calibrated multi-view images.
Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art.
arXiv Detail & Related papers (2023-06-12T21:45:18Z) - Multi-View Object Pose Refinement With Differentiable Renderer [22.040014384283378]
This paper introduces a novel multi-view 6 DoF object pose refinement approach focusing on improving methods trained on synthetic data.
It is based on the DPOD detector, which produces dense 2D-3D correspondences between the model vertices and the image pixels in each frame.
We report excellent performance in comparison to the state-of-the-art methods trained on the synthetic and real data.
arXiv Detail & Related papers (2022-07-06T17:02:22Z) - A High-Accuracy Unsupervised Person Re-identification Method Using
Auxiliary Information Mined from Datasets [53.047542904329866]
We make use of auxiliary information mined from datasets for multi-modal feature learning.
This paper proposes three effective training tricks, including Restricted Label Smoothing Cross Entropy Loss (RLSCE), Weight Adaptive Triplet Loss (WATL) and Dynamic Training Iterations (DTI)
arXiv Detail & Related papers (2022-05-06T10:16:18Z) - DeepMultiCap: Performance Capture of Multiple Characters Using Sparse
Multiview Cameras [63.186486240525554]
DeepMultiCap is a novel method for multi-person performance capture using sparse multi-view cameras.
Our method can capture time varying surface details without the need of using pre-scanned template models.
arXiv Detail & Related papers (2021-05-01T14:32:13Z) - Learning Monocular Dense Depth from Events [53.078665310545745]
Event cameras produce brightness changes in the form of a stream of asynchronous events instead of intensity frames.
Recent learning-based approaches have been applied to event-based data, such as monocular depth prediction.
We propose a recurrent architecture to solve this task and show significant improvement over standard feed-forward methods.
arXiv Detail & Related papers (2020-10-16T12:36:23Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.