Selective Spatio-Temporal Aggregation Based Pose Refinement System:
Towards Understanding Human Activities in Real-World Videos
- URL: http://arxiv.org/abs/2011.05358v1
- Date: Tue, 10 Nov 2020 19:19:51 GMT
- Title: Selective Spatio-Temporal Aggregation Based Pose Refinement System:
Towards Understanding Human Activities in Real-World Videos
- Authors: Di Yang, Rui Dai, Yaohui Wang, Rupayan Mallick, Luca Minciullo,
Gianpiero Francesca, Francois Bremond
- Abstract summary: State-of-the-art pose estimators struggle in obtaining high-quality 2D or 3D pose data due to truncation and low-resolution in real-world un-annotated videos.
We propose a Selective Spatio-Temporal Aggregation mechanism, named SST-A, that refines and smooths the keypoint locations extracted by multiple expert pose estimators.
We demonstrate that the skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at boosting various existing action recognition models.
- Score: 8.571131862820833
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Taking advantage of human pose data for understanding human activities has
attracted much attention these days. However, state-of-the-art pose estimators
struggle in obtaining high-quality 2D or 3D pose data due to occlusion,
truncation and low-resolution in real-world un-annotated videos. Hence, in this
work, we propose 1) a Selective Spatio-Temporal Aggregation mechanism, named
SST-A, that refines and smooths the keypoint locations extracted by multiple
expert pose estimators, 2) an effective weakly-supervised self-training
framework which leverages the aggregated poses as pseudo ground-truth instead
of handcrafted annotations for real-world pose estimation. Extensive
experiments are conducted for evaluating not only the upstream pose refinement
but also the downstream action recognition performance on four datasets, Toyota
Smarthome, NTU-RGB+D, Charades, and Kinetics-50. We demonstrate that the
skeleton data refined by our Pose-Refinement system (SSTA-PRS) is effective at
boosting various existing action recognition models, which achieves competitive
or state-of-the-art performance.
Related papers
- In My Perspective, In My Hands: Accurate Egocentric 2D Hand Pose and Action Recognition [1.4732811715354455]
Action recognition is essential for egocentric video understanding, allowing automatic and continuous monitoring of Activities of Daily Living (ADLs) without user effort.
Existing literature focuses on 3D hand pose input, which requires computationally intensive depth estimation networks or wearing an uncomfortable depth sensor.
We introduce two novel approaches for 2D hand pose estimation, namely EffHandNet for single-hand estimation and EffHandEgoNet, tailored for an egocentric perspective.
arXiv Detail & Related papers (2024-04-14T17:33:33Z) - Realistic Full-Body Tracking from Sparse Observations via Joint-Level
Modeling [13.284947022380404]
We propose a two-stage framework that can obtain accurate and smooth full-body motions with three tracking signals of head and hands only.
Our framework explicitly models the joint-level features in the first stage and utilizes them astemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage.
With extensive experiments on the AMASS motion dataset and real-captured data, we show our proposed method can achieve more accurate and smooth motion compared to existing approaches.
arXiv Detail & Related papers (2023-08-17T08:27:55Z) - Domain Adaptive 3D Pose Augmentation for In-the-wild Human Mesh Recovery [32.73513554145019]
Domain Adaptive 3D Pose Augmentation (DAPA) is a data augmentation method that enhances the model's generalization ability in in-the-wild scenarios.
We show quantitatively that finetuning with DAPA effectively improves results on benchmarks 3DPW and AGORA.
arXiv Detail & Related papers (2022-06-21T15:02:31Z) - PoseTriplet: Co-evolving 3D Human Pose Estimation, Imitation, and
Hallucination under Self-supervision [102.48681650013698]
Existing self-supervised 3D human pose estimation schemes have largely relied on weak supervisions to guide the learning.
We propose a novel self-supervised approach that allows us to explicitly generate 2D-3D pose pairs for augmenting supervision.
This is made possible via introducing a reinforcement-learning-based imitator, which is learned jointly with a pose estimator alongside a pose hallucinator.
arXiv Detail & Related papers (2022-03-29T14:45:53Z) - Uncertainty-Aware Adaptation for Self-Supervised 3D Human Pose
Estimation [70.32536356351706]
We introduce MRP-Net that constitutes a common deep network backbone with two output heads subscribing to two diverse configurations.
We derive suitable measures to quantify prediction uncertainty at both pose and joint level.
We present a comprehensive evaluation of the proposed approach and demonstrate state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2022-03-29T07:14:58Z) - Occlusion-Robust Object Pose Estimation with Holistic Representation [42.27081423489484]
State-of-the-art (SOTA) object pose estimators take a two-stage approach.
We develop a novel occlude-and-blackout batch augmentation technique.
We also develop a multi-precision supervision architecture to encourage holistic pose representation learning.
arXiv Detail & Related papers (2021-10-22T08:00:26Z) - Adversarial Motion Modelling helps Semi-supervised Hand Pose Estimation [116.07661813869196]
We propose to combine ideas from adversarial training and motion modelling to tap into unlabeled videos.
We show that an adversarial leads to better properties of the hand pose estimator via semi-supervised training on unlabeled video sequences.
The main advantage of our approach is that we can make use of unpaired videos and joint sequence data both of which are much easier to attain than paired training data.
arXiv Detail & Related papers (2021-06-10T17:50:19Z) - TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild [77.59069361196404]
TRiPOD is a novel method for predicting body dynamics based on graph attentional networks.
To incorporate a real-world challenge, we learn an indicator representing whether an estimated body joint is visible/invisible at each frame.
Our evaluation shows that TRiPOD outperforms all prior work and state-of-the-art specifically designed for each of the trajectory and pose forecasting tasks.
arXiv Detail & Related papers (2021-04-08T20:01:00Z) - Kinematic-Structure-Preserved Representation for Unsupervised 3D Human
Pose Estimation [58.72192168935338]
Generalizability of human pose estimation models developed using supervision on large-scale in-studio datasets remains questionable.
We propose a novel kinematic-structure-preserved unsupervised 3D pose estimation framework, which is not restrained by any paired or unpaired weak supervisions.
Our proposed model employs three consecutive differentiable transformations named as forward-kinematics, camera-projection and spatial-map transformation.
arXiv Detail & Related papers (2020-06-24T23:56:33Z) - Self-Supervised 3D Human Pose Estimation via Part Guided Novel Image
Synthesis [72.34794624243281]
We propose a self-supervised learning framework to disentangle variations from unlabeled video frames.
Our differentiable formalization, bridging the representation gap between the 3D pose and spatial part maps, allows us to operate on videos with diverse camera movements.
arXiv Detail & Related papers (2020-04-09T07:55:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.