BANMo: Building Animatable 3D Neural Models from Many Casual Videos
- URL: http://arxiv.org/abs/2112.12761v3
- Date: Mon, 3 Apr 2023 13:57:31 GMT
- Title: BANMo: Building Animatable 3D Neural Models from Many Casual Videos
- Authors: Gengshan Yang, Minh Vo, Natalia Neverova, Deva Ramanan, Andrea
Vedaldi, Hanbyul Joo
- Abstract summary: We present BANMo, a method that requires neither a specialized sensor nor a pre-defined template shape.
Banmo builds high-fidelity, articulated 3D models from many monocular casual videos in a differentiable rendering framework.
On real and synthetic datasets, BANMo shows higher-fidelity 3D reconstructions than prior works for humans and animals.
- Score: 135.64291166057373
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prior work for articulated 3D shape reconstruction often relies on
specialized sensors (e.g., synchronized multi-camera systems), or pre-built 3D
deformable models (e.g., SMAL or SMPL). Such methods are not able to scale to
diverse sets of objects in the wild. We present BANMo, a method that requires
neither a specialized sensor nor a pre-defined template shape. BANMo builds
high-fidelity, articulated 3D models (including shape and animatable skinning
weights) from many monocular casual videos in a differentiable rendering
framework. While the use of many videos provides more coverage of camera views
and object articulations, they introduce significant challenges in establishing
correspondence across scenes with different backgrounds, illumination
conditions, etc. Our key insight is to merge three schools of thought; (1)
classic deformable shape models that make use of articulated bones and blend
skinning, (2) volumetric neural radiance fields (NeRFs) that are amenable to
gradient-based optimization, and (3) canonical embeddings that generate
correspondences between pixels and an articulated model. We introduce neural
blend skinning models that allow for differentiable and invertible articulated
deformations. When combined with canonical embeddings, such models allow us to
establish dense correspondences across videos that can be self-supervised with
cycle consistency. On real and synthetic datasets, BANMo shows higher-fidelity
3D reconstructions than prior works for humans and animals, with the ability to
render realistic images from novel viewpoints and poses. Project webpage:
banmo-www.github.io .
Related papers
- DistillNeRF: Perceiving 3D Scenes from Single-Glance Images by Distilling Neural Fields and Foundation Model Features [65.8738034806085]
DistillNeRF is a self-supervised learning framework for understanding 3D environments in autonomous driving scenes.
Our method is a generalizable feedforward model that predicts a rich neural scene representation from sparse, single-frame multi-view camera inputs.
arXiv Detail & Related papers (2024-06-17T21:15:13Z) - PonderV2: Pave the Way for 3D Foundation Model with A Universal
Pre-training Paradigm [114.47216525866435]
We introduce a novel universal 3D pre-training framework designed to facilitate the acquisition of efficient 3D representation.
For the first time, PonderV2 achieves state-of-the-art performance on 11 indoor and outdoor benchmarks, implying its effectiveness.
arXiv Detail & Related papers (2023-10-12T17:59:57Z) - Reconstructing Animatable Categories from Videos [65.14948977749269]
Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging.
We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time.
We show that 3D models of humans, cats, and dogs can be learned from 50-100 internet videos.
arXiv Detail & Related papers (2023-05-10T17:56:21Z) - MoDA: Modeling Deformable 3D Objects from Casual Videos [84.29654142118018]
We propose neural dual quaternion blend skinning (NeuDBS) to achieve 3D point deformation without skin-collapsing artifacts.
In the endeavor to register 2D pixels across different frames, we establish a correspondence between canonical feature embeddings that encodes 3D points within the canonical space.
Our approach can reconstruct 3D models for humans and animals with better qualitative and quantitative performance than state-of-the-art methods.
arXiv Detail & Related papers (2023-04-17T13:49:04Z) - Structured 3D Features for Reconstructing Controllable Avatars [43.36074729431982]
We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface.
We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo and shading estimation.
arXiv Detail & Related papers (2022-12-13T18:57:33Z) - Disentangled3D: Learning a 3D Generative Model with Disentangled
Geometry and Appearance from Monocular Images [94.49117671450531]
State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis.
In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations.
arXiv Detail & Related papers (2022-03-29T22:03:18Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.