PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound
- URL: http://arxiv.org/abs/2112.00216v2
- Date: Fri, 3 Dec 2021 00:26:50 GMT
- Title: PoseKernelLifter: Metric Lifting of 3D Human Pose using Sound
- Authors: Zhijian Yang, Xiaoran Fan, Volkan Isler, Hyun Soo Park
- Abstract summary: Reconstructing the 3D pose of a person in metric scale from a single view image is a geometrically ill-posed problem.
We show that audio signals recorded along with an image, provide complementary information to reconstruct the metric 3D pose of the person.
We design a multi-stage 3D CNN that fuses audio and visual signals and learns to reconstruct 3D pose in a metric scale.
- Score: 34.814669331418884
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Reconstructing the 3D pose of a person in metric scale from a single view
image is a geometrically ill-posed problem. For example, we can not measure the
exact distance of a person to the camera from a single view image without
additional scene assumptions (e.g., known height). Existing learning based
approaches circumvent this issue by reconstructing the 3D pose up to scale.
However, there are many applications such as virtual telepresence, robotics,
and augmented reality that require metric scale reconstruction. In this paper,
we show that audio signals recorded along with an image, provide complementary
information to reconstruct the metric 3D pose of the person.
The key insight is that as the audio signals traverse across the 3D space,
their interactions with the body provide metric information about the body's
pose. Based on this insight, we introduce a time-invariant transfer function
called pose kernel -- the impulse response of audio signals induced by the body
pose. The main properties of the pose kernel are that (1) its envelope highly
correlates with 3D pose, (2) the time response corresponds to arrival time,
indicating the metric distance to the microphone, and (3) it is invariant to
changes in the scene geometry configurations. Therefore, it is readily
generalizable to unseen scenes. We design a multi-stage 3D CNN that fuses audio
and visual signals and learns to reconstruct 3D pose in a metric scale. We show
that our multi-modal method produces accurate metric reconstruction in real
world scenes, which is not possible with state-of-the-art lifting approaches
including parametric mesh regression and depth regression.
Related papers
- Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos [15.532504015622159]
Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics.
We tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos.
arXiv Detail & Related papers (2024-07-05T09:43:05Z) - Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences [21.057940424318314]
Given two images, we can estimate the relative camera pose between them by establishing image-to-image correspondences.
We present MicKey, a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space.
arXiv Detail & Related papers (2024-04-09T14:22:50Z) - DUSt3R: Geometric 3D Vision Made Easy [8.471330244002564]
We introduce DUSt3R, a novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections.
We show that this formulation smoothly unifies the monocular and binocular reconstruction cases.
Our formulation directly provides a 3D model of the scene as well as depth information, but interestingly, we can seamlessly recover from it, pixel matches, relative and absolute camera.
arXiv Detail & Related papers (2023-12-21T18:52:14Z) - Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image [85.91935485902708]
We show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models.
We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models.
Our method enables the accurate recovery of metric 3D structures on randomly collected internet images.
arXiv Detail & Related papers (2023-07-20T16:14:23Z) - IVT: An End-to-End Instance-guided Video Transformer for 3D Pose
Estimation [6.270047084514142]
Video 3D human pose estimation aims to localize the 3D coordinates of human joints from videos.
IVT enables learningtemporal contextual depth information from visual features and 3D poses directly from video frames.
Experiments on three widely-used 3D pose estimation benchmarks show that the proposed IVT achieves state-of-the-art performances.
arXiv Detail & Related papers (2022-08-06T02:36:33Z) - Disentangled3D: Learning a 3D Generative Model with Disentangled
Geometry and Appearance from Monocular Images [94.49117671450531]
State-of-the-art 3D generative models are GANs which use neural 3D volumetric representations for synthesis.
In this paper, we design a 3D GAN which can learn a disentangled model of objects, just from monocular observations.
arXiv Detail & Related papers (2022-03-29T22:03:18Z) - VoxelTrack: Multi-Person 3D Human Pose Estimation and Tracking in the
Wild [98.69191256693703]
We present VoxelTrack for multi-person 3D pose estimation and tracking from a few cameras which are separated by wide baselines.
It employs a multi-branch network to jointly estimate 3D poses and re-identification (Re-ID) features for all people in the environment.
It outperforms the state-of-the-art methods by a large margin on three public datasets including Shelf, Campus and CMU Panoptic.
arXiv Detail & Related papers (2021-08-05T08:35:44Z) - SMAP: Single-Shot Multi-Person Absolute 3D Pose Estimation [46.85865451812981]
We propose a novel system that first regresses a set of 2.5D representations of body parts and then reconstructs the 3D absolute poses based on these 2.5D representations with a depth-aware part association algorithm.
Such a single-shot bottom-up scheme allows the system to better learn and reason about the inter-person depth relationship, improving both 3D and 2D pose estimation.
arXiv Detail & Related papers (2020-08-26T09:56:07Z) - Lightweight Multi-View 3D Pose Estimation through Camera-Disentangled
Representation [57.11299763566534]
We present a solution to recover 3D pose from multi-view images captured with spatially calibrated cameras.
We exploit 3D geometry to fuse input images into a unified latent representation of pose, which is disentangled from camera view-points.
Our architecture then conditions the learned representation on camera projection operators to produce accurate per-view 2d detections.
arXiv Detail & Related papers (2020-04-05T12:52:29Z) - Fusing Wearable IMUs with Multi-View Images for Human Pose Estimation: A
Geometric Approach [76.10879433430466]
We propose to estimate 3D human pose from multi-view images and a few IMUs attached at person's limbs.
It operates by firstly detecting 2D poses from the two signals, and then lifting them to the 3D space.
The simple two-step approach reduces the error of the state-of-the-art by a large margin on a public dataset.
arXiv Detail & Related papers (2020-03-25T00:26:54Z) - Chained Representation Cycling: Learning to Estimate 3D Human Pose and
Shape by Cycling Between Representations [73.11883464562895]
We propose a new architecture that facilitates unsupervised, or lightly supervised, learning.
We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images.
While we present results for modeling humans, our formulation is general and can be applied to other vision problems.
arXiv Detail & Related papers (2020-01-06T14:54:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.