Reconstructing Close Human Interactions from Multiple Views
- URL: http://arxiv.org/abs/2401.16173v1
- Date: Mon, 29 Jan 2024 14:08:02 GMT
- Title: Reconstructing Close Human Interactions from Multiple Views
- Authors: Qing Shuai, Zhiyuan Yu, Zhize Zhou, Lixin Fan, Haijun Yang, Can Yang,
Xiaowei Zhou
- Abstract summary: This paper addresses the challenging task of reconstructing the poses of multiple individuals engaged in close interactions, captured by multiple calibrated cameras.
We introduce a novel system to address these challenges.
Our system integrates a learning-based pose estimation component and its corresponding training and inference strategies.
- Score: 38.924950289788804
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper addresses the challenging task of reconstructing the poses of
multiple individuals engaged in close interactions, captured by multiple
calibrated cameras. The difficulty arises from the noisy or false 2D keypoint
detections due to inter-person occlusion, the heavy ambiguity in associating
keypoints to individuals due to the close interactions, and the scarcity of
training data as collecting and annotating motion data in crowded scenes is
resource-intensive. We introduce a novel system to address these challenges.
Our system integrates a learning-based pose estimation component and its
corresponding training and inference strategies. The pose estimation component
takes multi-view 2D keypoint heatmaps as input and reconstructs the pose of
each individual using a 3D conditional volumetric network. As the network
doesn't need images as input, we can leverage known camera parameters from test
scenes and a large quantity of existing motion capture data to synthesize
massive training data that mimics the real data distribution in test scenes.
Extensive experiments demonstrate that our approach significantly surpasses
previous approaches in terms of pose accuracy and is generalizable across
various camera setups and population sizes. The code is available on our
project page: https://github.com/zju3dv/CloseMoCap.
Related papers
- Multi-person 3D pose estimation from unlabelled data [2.54990557236581]
We present a model based on Graph Neural Networks capable of predicting the cross-view correspondence of the people in the scenario.
We also present a Multilayer Perceptron that takes the 2D points to yield the 3D poses of each person.
arXiv Detail & Related papers (2022-12-16T22:03:37Z) - MetaPose: Fast 3D Pose from Multiple Views without 3D Supervision [72.5863451123577]
We show how to train a neural model that can perform accurate 3D pose and camera estimation.
Our method outperforms both classical bundle adjustment and weakly-supervised monocular 3D baselines.
arXiv Detail & Related papers (2021-08-10T18:39:56Z) - Multi-View Multi-Person 3D Pose Estimation with Plane Sweep Stereo [71.59494156155309]
Existing approaches for multi-view 3D pose estimation explicitly establish cross-view correspondences to group 2D pose detections from multiple camera views.
We present our multi-view 3D pose estimation approach based on plane sweep stereo to jointly address the cross-view fusion and 3D pose reconstruction in a single shot.
arXiv Detail & Related papers (2021-04-06T03:49:35Z) - View-Invariant, Occlusion-Robust Probabilistic Embedding for Human Pose [36.384824115033304]
We propose an approach to learning a compact view-invariant embedding space from 2D body joint keypoints, without explicitly predicting 3D poses.
Experimental results show that our embedding model achieves higher accuracy when retrieving similar poses across different camera views.
arXiv Detail & Related papers (2020-10-23T17:58:35Z) - Toward Accurate Person-level Action Recognition in Videos of Crowded
Scenes [131.9067467127761]
We focus on improving the action recognition by fully-utilizing the information of scenes and collecting new data.
Specifically, we adopt a strong human detector to detect spatial location of each frame.
We then apply action recognition models to learn thetemporal information from video frames on both the HIE dataset and new data with diverse scenes from the internet.
arXiv Detail & Related papers (2020-10-16T13:08:50Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Self-Supervised Multi-View Synchronization Learning for 3D Pose
Estimation [39.334995719523]
Current methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses.
We propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets.
We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T08:01:24Z) - Leveraging Photometric Consistency over Time for Sparsely Supervised
Hand-Object Reconstruction [118.21363599332493]
We present a method to leverage photometric consistency across time when annotations are only available for a sparse subset of frames in a video.
Our model is trained end-to-end on color images to jointly reconstruct hands and objects in 3D by inferring their poses.
We achieve state-of-the-art results on 3D hand-object reconstruction benchmarks and demonstrate that our approach allows us to improve the pose estimation accuracy.
arXiv Detail & Related papers (2020-04-28T12:03:14Z) - Self-supervised Keypoint Correspondences for Multi-Person Pose
Estimation and Tracking in Videos [32.43899916477434]
We propose an approach that relies on keypoint correspondences for associating persons in videos.
Instead of training the network for estimating keypoint correspondences on video data, it is trained on a large scale image datasets for human pose estimation.
Our approach achieves state-of-the-art results for multi-frame pose estimation and multi-person pose tracking on the PosTrack $2017$ and PoseTrack $2018$ data sets.
arXiv Detail & Related papers (2020-04-27T09:02:24Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.