DECA: Deep viewpoint-Equivariant human pose estimation using Capsule
Autoencoders
- URL: http://arxiv.org/abs/2108.08557v1
- Date: Thu, 19 Aug 2021 08:46:15 GMT
- Title: DECA: Deep viewpoint-Equivariant human pose estimation using Capsule
Autoencoders
- Authors: Nicola Garau, Niccol\`o Bisagno, Piotr Br\'odka, Nicola Conci
- Abstract summary: We show that current 3D Human Pose Estimation methods tend to fail when dealing with viewpoints unseen at training time.
We propose a novel capsule autoencoder network with fast Variational Bayes capsule routing, named DECA.
In the experimental validation, we outperform other methods on depth images from both seen and unseen viewpoints, both top-view, and front-view.
- Score: 3.2826250607043796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human Pose Estimation (HPE) aims at retrieving the 3D position of human
joints from images or videos. We show that current 3D HPE methods suffer a lack
of viewpoint equivariance, namely they tend to fail or perform poorly when
dealing with viewpoints unseen at training time. Deep learning methods often
rely on either scale-invariant, translation-invariant, or rotation-invariant
operations, such as max-pooling. However, the adoption of such procedures does
not necessarily improve viewpoint generalization, rather leading to more
data-dependent methods. To tackle this issue, we propose a novel capsule
autoencoder network with fast Variational Bayes capsule routing, named DECA. By
modeling each joint as a capsule entity, combined with the routing algorithm,
our approach can preserve the joints' hierarchical and geometrical structure in
the feature space, independently from the viewpoint. By achieving viewpoint
equivariance, we drastically reduce the network data dependency at training
time, resulting in an improved ability to generalize for unseen viewpoints. In
the experimental validation, we outperform other methods on depth images from
both seen and unseen viewpoints, both top-view, and front-view. In the RGB
domain, the same network gives state-of-the-art results on the challenging
viewpoint transfer task, also establishing a new framework for top-view HPE.
The code can be found at https://github.com/mmlab-cv/DECA.
Related papers
- GLACE: Global Local Accelerated Coordinate Encoding [66.87005863868181]
Scene coordinate regression methods are effective in small-scale scenes but face significant challenges in large-scale scenes.
We propose GLACE, which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network.
Our method achieves state-of-the-art results on large-scale scenes with a low-map-size model.
arXiv Detail & Related papers (2024-06-06T17:59:50Z) - IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images [50.4538089115248]
Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task.
We propose a novel approach, IPoD, which harmonizes implicit field learning with point diffusion.
Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD, achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods.
arXiv Detail & Related papers (2024-03-30T07:17:37Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Capsules as viewpoint learners for human pose estimation [4.246061945756033]
We show how most neural networks are not able to generalize well when the camera is subject to significant viewpoint changes.
We propose a novel end-to-end viewpoint-equivariant capsule autoencoder that employs a fast Variational Bayes routing and matrix capsules.
We achieve state-of-the-art results for multiple tasks and datasets while retaining other desirable properties.
arXiv Detail & Related papers (2023-02-13T09:01:46Z) - KeypointNeRF: Generalizing Image-based Volumetric Avatars using Relative
Spatial Encoding of Keypoints [28.234772596912165]
We propose a highly effective approach to modeling high-fidelity volumetric avatars from sparse views.
One of the key ideas is to encode relative spatial 3D information via sparse 3D keypoints.
Our experiments show that a majority of errors in prior work stem from an inappropriate choice of spatial encoding.
arXiv Detail & Related papers (2022-05-10T15:57:03Z) - ConDor: Self-Supervised Canonicalization of 3D Pose for Partial Shapes [55.689763519293464]
ConDor is a self-supervised method that learns to canonicalize the 3D orientation and position for full and partial 3D point clouds.
During inference, our method takes an unseen full or partial 3D point cloud at an arbitrary pose and outputs an equivariant canonical pose.
arXiv Detail & Related papers (2022-01-19T18:57:21Z) - Locally Aware Piecewise Transformation Fields for 3D Human Mesh
Registration [67.69257782645789]
We propose piecewise transformation fields that learn 3D translation vectors to map any query point in posed space to its correspond position in rest-pose space.
We show that fitting parametric models with poses by our network results in much better registration quality, especially for extreme poses.
arXiv Detail & Related papers (2021-04-16T15:16:09Z) - Monocular 3D Detection with Geometric Constraints Embedding and
Semi-supervised Training [3.8073142980733]
We propose a novel framework for monocular 3D objects detection using only RGB images, called KM3D-Net.
We design a fully convolutional model to predict object keypoints, dimension, and orientation, and then combine these estimations with perspective geometry constraints to compute position attribute.
arXiv Detail & Related papers (2020-09-02T00:51:51Z) - Coherent Reconstruction of Multiple Humans from a Single Image [68.3319089392548]
In this work, we address the problem of multi-person 3D pose estimation from a single image.
A typical regression approach in the top-down setting of this problem would first detect all humans and then reconstruct each one of them independently.
Our goal is to train a single network that learns to avoid these problems and generate a coherent 3D reconstruction of all the humans in the scene.
arXiv Detail & Related papers (2020-06-15T17:51:45Z) - Towards Better Generalization: Joint Depth-Pose Learning without PoseNet [36.414471128890284]
We tackle the essential problem of scale inconsistency for self-supervised joint depth-pose learning.
Most existing methods assume that a consistent scale of depth and pose can be learned across all input samples.
We propose a novel system that explicitly disentangles scale from the network estimation.
arXiv Detail & Related papers (2020-04-03T00:28:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.