Self-Supervised Multi-View Synchronization Learning for 3D Pose
Estimation
- URL: http://arxiv.org/abs/2010.06218v1
- Date: Tue, 13 Oct 2020 08:01:24 GMT
- Title: Self-Supervised Multi-View Synchronization Learning for 3D Pose
Estimation
- Authors: Simon Jenni, Paolo Favaro
- Abstract summary: Current methods cast monocular 3D human pose estimation as a learning problem by training neural networks on large data sets of images and corresponding skeleton poses.
We propose an approach that can exploit small annotated data sets by fine-tuning networks pre-trained via self-supervised learning on (large) unlabeled data sets.
We demonstrate the effectiveness of the synchronization task on the Human3.6M data set and achieve state-of-the-art results in 3D human pose estimation.
- Score: 39.334995719523
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Current state-of-the-art methods cast monocular 3D human pose estimation as a
learning problem by training neural networks on large data sets of images and
corresponding skeleton poses. In contrast, we propose an approach that can
exploit small annotated data sets by fine-tuning networks pre-trained via
self-supervised learning on (large) unlabeled data sets. To drive such networks
towards supporting 3D pose estimation during the pre-training step, we
introduce a novel self-supervised feature learning task designed to focus on
the 3D structure in an image. We exploit images extracted from videos captured
with a multi-view camera system. The task is to classify whether two images
depict two views of the same scene up to a rigid transformation. In a
multi-view data set, where objects deform in a non-rigid manner, a rigid
transformation occurs only between two views taken at the exact same time,
i.e., when they are synchronized. We demonstrate the effectiveness of the
synchronization task on the Human3.6M data set and achieve state-of-the-art
results in 3D human pose estimation.
Related papers
- Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Enhancing Egocentric 3D Pose Estimation with Third Person Views [37.9683439632693]
We propose a novel approach to enhance the 3D body pose estimation of a person computed from videos captured from a single wearable camera.
We introduce First2Third-Pose, a new paired synchronized dataset of nearly 2,000 videos depicting human activities captured from both first- and third-view perspectives.
Experimental results demonstrate that the joint multi-view embedded space learned with our dataset is useful to extract discriminatory features from arbitrary single-view egocentric videos.
arXiv Detail & Related papers (2022-01-06T11:42:01Z) - Graph-Based 3D Multi-Person Pose Estimation Using Multi-View Images [79.70127290464514]
We decompose the task into two stages, i.e. person localization and pose estimation.
And we propose three task-specific graph neural networks for effective message passing.
Our approach achieves state-of-the-art performance on CMU Panoptic and Shelf datasets.
arXiv Detail & Related papers (2021-09-13T11:44:07Z) - Multi-Scale Networks for 3D Human Pose Estimation with Inference Stage
Optimization [33.02708860641971]
Estimating 3D human poses from a monocular video is still a challenging task.
Many existing methods drop when the target person is cluded by other objects, or the motion is too fast/slow relative to the scale and speed of the training data.
We introduce atemporal-temporal network for robust 3D human pose estimation.
arXiv Detail & Related papers (2020-10-13T15:24:28Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - Self-supervised Feature Learning by Cross-modality and Cross-view
Correspondences [32.01548991331616]
This paper presents a novel self-supervised learning approach to learn both 2D image features and 3D point cloud features.
It exploits cross-modality and cross-view correspondences without using any annotated human labels.
The effectiveness of the learned 2D and 3D features is evaluated by transferring them on five different tasks.
arXiv Detail & Related papers (2020-04-13T02:57:25Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z) - Weakly-Supervised 3D Human Pose Learning via Multi-view Images in the
Wild [101.70320427145388]
We propose a weakly-supervised approach that does not require 3D annotations and learns to estimate 3D poses from unlabeled multi-view data.
We evaluate our proposed approach on two large scale datasets.
arXiv Detail & Related papers (2020-03-17T08:47:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.