Deep Reinforcement Learning for Active Human Pose Estimation
- URL: http://arxiv.org/abs/2001.02024v2
- Date: Wed, 16 Dec 2020 10:23:42 GMT
- Title: Deep Reinforcement Learning for Active Human Pose Estimation
- Authors: Erik G\"artner, Aleksis Pirinen, Cristian Sminchisescu
- Abstract summary: We introduce Pose-DRL, a fully trainable deep reinforcement learning-based active pose estimation architecture.
We show that our model learns to select viewpoints that yield significantly more accurate pose estimates compared to strong multi-view baselines.
- Score: 35.229529080763925
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Most 3d human pose estimation methods assume that input -- be it images of a
scene collected from one or several viewpoints, or from a video -- is given.
Consequently, they focus on estimates leveraging prior knowledge and
measurement by fusing information spatially and/or temporally, whenever
available. In this paper we address the problem of an active observer with
freedom to move and explore the scene spatially -- in `time-freeze' mode --
and/or temporally, by selecting informative viewpoints that improve its
estimation accuracy. Towards this end, we introduce Pose-DRL, a fully trainable
deep reinforcement learning-based active pose estimation architecture which
learns to select appropriate views, in space and time, to feed an underlying
monocular pose estimator. We evaluate our model using single- and multi-target
estimators with strong result in both settings. Our system further learns
automatic stopping conditions in time and transition functions to the next
temporal processing step in videos. In extensive experiments with the Panoptic
multi-view setup, and for complex scenes containing multiple people, we show
that our model learns to select viewpoints that yield significantly more
accurate pose estimates compared to strong multi-view baselines.
Related papers
- TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting [27.3359362364858]
We present an efficient multi-view pose estimation model that learns a robust temporal representation.
Our model is able to generalize across datasets without fine-tuning.
arXiv Detail & Related papers (2023-09-14T17:56:30Z) - Mutual Information-Based Temporal Difference Learning for Human Pose
Estimation in Video [16.32910684198013]
We present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts.
To be specific, we design a multi-stage entangled learning sequences conditioned on multi-stage differences to derive informative motion representation sequences.
These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark HiEve.
arXiv Detail & Related papers (2023-03-15T09:29:03Z) - Live Stream Temporally Embedded 3D Human Body Pose and Shape Estimation [13.40702053084305]
We present a temporally embedded 3D human body pose and shape estimation (TePose) method to improve the accuracy and temporal consistency pose in live stream videos.
A multi-scale convolutional network is presented as the motion discriminator for adversarial training using datasets without any 3D labeling.
arXiv Detail & Related papers (2022-07-25T21:21:59Z) - Snipper: A Spatiotemporal Transformer for Simultaneous Multi-Person 3D
Pose Estimation Tracking and Forecasting on a Video Snippet [24.852728097115744]
Multi-person pose understanding from RGB involves three complex tasks: pose estimation, tracking and motion forecasting.
Most existing works either focus on a single task or employ multi-stage approaches to solving multiple tasks separately.
We propose Snipper, a unified framework to perform multi-person 3D pose estimation, tracking, and motion forecasting simultaneously in a single stage.
arXiv Detail & Related papers (2022-07-09T18:42:14Z) - A Spatio-Temporal Multilayer Perceptron for Gesture Recognition [70.34489104710366]
We propose a multilayer state-weighted perceptron for gesture recognition in the context of autonomous vehicles.
An evaluation of TCG and Drive&Act datasets is provided to showcase the promising performance of our approach.
We deploy our model to our autonomous vehicle to show its real-time capability and stable execution.
arXiv Detail & Related papers (2022-04-25T08:42:47Z) - SurroundDepth: Entangling Surrounding Views for Self-Supervised
Multi-Camera Depth Estimation [101.55622133406446]
We propose a SurroundDepth method to incorporate the information from multiple surrounding views to predict depth maps across cameras.
Specifically, we employ a joint network to process all the surrounding views and propose a cross-view transformer to effectively fuse the information from multiple views.
In experiments, our method achieves the state-of-the-art performance on the challenging multi-camera depth estimation datasets.
arXiv Detail & Related papers (2022-04-07T17:58:47Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - Self-Attentive 3D Human Pose and Shape Estimation from Videos [82.63503361008607]
We present a video-based learning algorithm for 3D human pose and shape estimation.
We exploit temporal information in videos and propose a self-attention module.
We evaluate our method on the 3DPW, MPI-INF-3DHP, and Human3.6M datasets.
arXiv Detail & Related papers (2021-03-26T00:02:19Z) - Unsupervised Learning on Monocular Videos for 3D Human Pose Estimation [121.5383855764944]
We use contrastive self-supervised learning to extract rich latent vectors from single-view videos.
We show that applying CSS only to the time-variant features, while also reconstructing the input and encouraging a gradual transition between nearby and away features, yields a rich latent space.
Our approach outperforms other unsupervised single-view methods and matches the performance of multi-view techniques.
arXiv Detail & Related papers (2020-12-02T20:27:35Z) - Towards Accurate Human Pose Estimation in Videos of Crowded Scenes [134.60638597115872]
We focus on improving human pose estimation in videos of crowded scenes from the perspectives of exploiting temporal context and collecting new data.
For one frame, we forward the historical poses from the previous frames and backward the future poses from the subsequent frames to current frame, leading to stable and accurate human pose estimation in videos.
In this way, our model achieves best performance on 7 out of 13 videos and 56.33 average w_AP on test dataset of HIE challenge.
arXiv Detail & Related papers (2020-10-16T13:19:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.