KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative
Spatial Encoding of Keypoints
- URL: http://arxiv.org/abs/2205.04992v1
- Date: Tue, 10 May 2022 15:57:03 GMT
- Title: KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative
Spatial Encoding of Keypoints
- Authors: Marko Mihajlovic, Aayush Bansal, Michael Zollhoefer, Siyu Tang,
Shunsuke Saito
- Abstract summary: We propose a highly effective approach to modeling high-fidelity volumetric avatars from sparse views.
One of the key ideas is to encode relative spatial 3D information via sparse 3D keypoints.
Our experiments show that a majority of errors in prior work stem from an inappropriate choice of spatial encoding.
- Score: 28.234772596912165
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Image-based volumetric avatars using pixel-aligned features promise
generalization to unseen poses and identities. Prior work leverages global
spatial encodings and multi-view geometric consistency to reduce spatial
ambiguity. However, global encodings often suffer from overfitting to the
distribution of the training data, and it is difficult to learn multi-view
consistent reconstruction from sparse views. In this work, we investigate
common issues with existing spatial encodings and propose a simple yet highly
effective approach to modeling high-fidelity volumetric avatars from sparse
views. One of the key ideas is to encode relative spatial 3D information via
sparse 3D keypoints. This approach is robust to the sparsity of viewpoints and
cross-dataset domain gap. Our approach outperforms state-of-the-art methods for
head reconstruction. On human body reconstruction for unseen subjects, we also
achieve performance comparable to prior work that uses a parametric human body
model and temporal feature aggregation. Our experiments show that a majority of
errors in prior work stem from an inappropriate choice of spatial encoding and
thus we suggest a new direction for high-fidelity image-based avatar modeling.
https://markomih.github.io/KeypointNeRF
Related papers
- StackFLOW: Monocular Human-Object Reconstruction by Stacked Normalizing Flow with Offset [56.71580976007712]
We propose to use the Human-Object Offset between anchors which are densely sampled from the surface of human mesh and object mesh to represent human-object spatial relation.
Based on this representation, we propose Stacked Normalizing Flow (StackFLOW) to infer the posterior distribution of human-object spatial relations from the image.
During the optimization stage, we finetune the human body pose and object 6D pose by maximizing the likelihood of samples.
arXiv Detail & Related papers (2024-07-30T04:57:21Z) - Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object
Structure via HyperNetworks [53.67497327319569]
We introduce a novel neural rendering technique to solve image-to-3D from a single view.
Our approach employs the signed distance function as the surface representation and incorporates generalizable priors through geometry-encoding volumes and HyperNetworks.
Our experiments show the advantages of our proposed approach with consistent results and rapid generation.
arXiv Detail & Related papers (2023-12-24T08:42:37Z) - InvertAvatar: Incremental GAN Inversion for Generalized Head Avatars [40.10906393484584]
We propose a novel framework that enhances avatar reconstruction performance using an algorithm designed to increase the fidelity from multiple frames.
Our architecture emphasizes pixel-aligned image-to-image translation, mitigating the need to learn correspondences between observation and canonical spaces.
The proposed paradigm demonstrates state-of-the-art performance on one-shot and few-shot avatar animation tasks.
arXiv Detail & Related papers (2023-12-03T18:59:15Z) - HQ3DAvatar: High Quality Controllable 3D Head Avatar [65.70885416855782]
This paper presents a novel approach to building highly photorealistic digital head avatars.
Our method learns a canonical space via an implicit function parameterized by a neural network.
At test time, our method is driven by a monocular RGB video.
arXiv Detail & Related papers (2023-03-25T13:56:33Z) - LiP-Flow: Learning Inference-time Priors for Codec Avatars via
Normalizing Flows in Latent Space [90.74976459491303]
We introduce a prior model that is conditioned on the runtime inputs and tie this prior space to the 3D face model via a normalizing flow in the latent space.
A normalizing flow bridges the two representation spaces and transforms latent samples from one domain to another, allowing us to define a latent likelihood objective.
We show that our approach leads to an expressive and effective prior, capturing facial dynamics and subtle expressions better.
arXiv Detail & Related papers (2022-03-15T13:22:57Z) - DECA: Deep viewpoint-Equivariant human pose estimation using Capsule
Autoencoders [3.2826250607043796]
We show that current 3D Human Pose Estimation methods tend to fail when dealing with viewpoints unseen at training time.
We propose a novel capsule autoencoder network with fast Variational Bayes capsule routing, named DECA.
In the experimental validation, we outperform other methods on depth images from both seen and unseen viewpoints, both top-view, and front-view.
arXiv Detail & Related papers (2021-08-19T08:46:15Z) - PVA: Pixel-aligned Volumetric Avatars [34.929560973779466]
We devise a novel approach for predicting volumetric avatars of the human head given just a small number of inputs.
Our approach is trained in an end-to-end manner solely based on a photometric re-rendering loss without requiring explicit 3D supervision.
arXiv Detail & Related papers (2021-01-07T18:58:46Z) - Coherent Reconstruction of Multiple Humans from a Single Image [68.3319089392548]
In this work, we address the problem of multi-person 3D pose estimation from a single image.
A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently.
Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene.
arXiv Detail & Related papers (2020-06-15T17:51:45Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.