Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation
- URL: http://arxiv.org/abs/2012.01511v2
- Date: Thu, 25 Mar 2021 18:17:03 GMT
- Title: Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation
- Authors: Sina Honari, Victor Constantin, Helge Rhodin, Mathieu Salzmann, Pascal
Fua
- Abstract summary: We use contrastive self-supervised learning to extract rich latent vectors from single-view videos.
We show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space.
Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.
- Score: 121.5383855764944
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: In the presence of annotated data, deep human pose estimation networks yield
impressive performance. Nevertheless, annotating new data is extremely
time-consuming, particularly in real-world conditions. Here, we address this by
leveraging contrastive self-supervised (CSS) learning to extract rich latent
vectors from single-view videos. Instead of simply treating the latent features
of nearby frames as positive pairs and those of temporally-distant ones as
negative pairs as in other CSS approaches, we explicitly disentangle each
latent vector into a time-variant component and a time-invariant one. We then
show that applying CSS only to the time-variant features, while also
reconstructing the input and encouraging a gradual transition between nearby
and away features, yields a rich latent space, well-suited for human pose
estimation. Our approach outperforms other unsupervised single-view methods and
matches the performance of multi-view techniques.
Related papers
- A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - Spatio-Temporal Self-Attention Network for Video Saliency Prediction [13.873682190242365]
3D convolutional neural networks have achieved promising results for video tasks in computer vision.
We propose a novel Spatio-Temporal Self-Temporal Self-Attention 3 Network (STSANet) for video saliency prediction.
arXiv Detail & Related papers (2021-08-24T12:52:47Z) - Revisiting Contrastive Methods for Unsupervised Learning of Visual
Representations [78.12377360145078]
Contrastive self-supervised learning has outperformed supervised pretraining on many downstream tasks like segmentation and object detection.
In this paper, we first study how biases in the dataset affect existing methods.
We show that current contrastive approaches work surprisingly well across: (i) object- versus scene-centric, (ii) uniform versus long-tailed and (iii) general versus domain-specific datasets.
arXiv Detail & Related papers (2021-06-10T17:59:13Z) - A Large-Scale Study on Unsupervised Spatiotemporal Representation
Learning [60.720251418816815]
We present a large-scale study on unsupervised representation learning from videos.
Our objective encourages temporally-persistent features in the same video.
We find that encouraging long-spanned persistency can be effective even if the timespan is 60 seconds.
arXiv Detail & Related papers (2021-04-29T17:59:53Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Unsupervised Cross-Modal Alignment for Multi-Person 3D Pose Estimation [52.94078950641959]
We present a deployment friendly, fast bottom-up framework for multi-person 3D human pose estimation.
We adopt a novel neural representation of multi-person 3D pose which unifies the position of person instances with their corresponding 3D pose representation.
We propose a practical deployment paradigm where paired 2D or 3D pose annotations are unavailable.
arXiv Detail & Related papers (2020-08-04T07:54:25Z) - 3D Human Pose Estimation using Spatio-Temporal Networks with Explicit
Occlusion Training [40.933783830017035]
Estimating 3D poses from a monocular task is still a challenging task, despite the significant progress that has been made in recent years.
We introduce a-temporal video network for robust 3D human pose estimation.
We apply multi-scale spatial features for 2D joints or keypoints prediction in each individual frame, and multistride temporal convolutional net-works (TCNs) to estimate 3D joints or keypoints.
arXiv Detail & Related papers (2020-04-07T09:12:12Z) - Deep Reinforcement Learning for Active Human Pose Estimation [35.229529080763925]
We introduce Pose-DRL, a fully trainable deep reinforcement learning-based active pose estimation architecture.
We show that our model learns to select viewpoints that yield significantly more accurate pose estimates compared to strong multi-view baselines.
arXiv Detail & Related papers (2020-01-07T13:35:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.