Kinematics Modeling Network for Video-based Human Pose Estimation
- URL: http://arxiv.org/abs/2207.10971v2
- Date: Wed, 17 Apr 2024 02:39:19 GMT
- Title: Kinematics Modeling Network for Video-based Human Pose Estimation
- Authors: Yonghao Dang, Jianqin Yin, Shaojie Zhang, Jiping Liu, Yanzhu Hu,
- Abstract summary: Estimating human poses from videos is critical in human-computer interaction.
Joints cooperate rather than move independently during human movement.
We propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints.
- Score: 9.506011491028891
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Estimating human poses from videos is critical in human-computer interaction. Joints cooperate rather than move independently during human movement. There are both spatial and temporal correlations between joints. Despite the positive results of previous approaches, most focus on modeling the spatial correlation between joints while only straightforwardly integrating features along the temporal dimension, ignoring the temporal correlation between joints. In this work, we propose a plug-and-play kinematics modeling module (KMM) to explicitly model temporal correlations between joints across different frames by calculating their temporal similarity. In this way, KMM can capture motion cues of the current joint relative to all joints in different time. Besides, we formulate video-based human pose estimation as a Markov Decision Process and design a novel kinematics modeling network (KIMNet) to simulate the Markov Chain, allowing KIMNet to locate joints recursively. Our approach achieves state-of-the-art results on two challenging benchmarks. In particular, KIMNet shows robustness to the occlusion. The code will be released at https://github.com/YHDang/KIMNet.
Related papers
- Video-Based Human Pose Regression via Decoupled Space-Time Aggregation [0.5524804393257919]
We develop an efficient and effective video-based human pose regression method, which bypasses intermediate representations such as asmaps and instead directly maps the input to the joint coordinates.
Our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself.
Our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods.
arXiv Detail & Related papers (2024-03-29T02:26:22Z) - A Decoupled Spatio-Temporal Framework for Skeleton-based Action
Segmentation [89.86345494602642]
Existing methods are limited in weak-temporal modeling capability.
We propose a Decoupled Scoupled Framework (DeST) to address the issues.
DeST significantly outperforms current state-of-the-art methods with less computational complexity.
arXiv Detail & Related papers (2023-12-10T09:11:39Z) - Multiscale Residual Learning of Graph Convolutional Sequence Chunks for
Human Motion Prediction [23.212848643552395]
A new method is proposed for human motion prediction by learning temporal and spatial dependencies.
Our proposed method is able to effectively model the sequence information for motion prediction and outperform other techniques to set a new state-of-the-art.
arXiv Detail & Related papers (2023-08-31T15:23:33Z) - Shuffled Autoregression For Motion Interpolation [53.61556200049156]
This work aims to provide a deep-learning solution for the motion task.
We propose a novel framework, referred to as emphShuffled AutoRegression, which expands the autoregression to generate in arbitrary (shuffled) order.
We also propose an approach to constructing a particular kind of dependency graph, with three stages assembled into an end-to-end spatial-temporal motion Transformer.
arXiv Detail & Related papers (2023-06-10T07:14:59Z) - Video Frame Interpolation with Densely Queried Bilateral Correlation [52.823751291070906]
Video Frame Interpolation (VFI) aims to synthesize non-existent intermediate frames between existent frames.
Flow-based VFI algorithms estimate intermediate motion fields to warp the existent frames.
We propose Densely Queried Bilateral Correlation (DQBC) that gets rid of the receptive field dependency problem.
arXiv Detail & Related papers (2023-04-26T14:45:09Z) - MixSTE: Seq2seq Mixed Spatio-Temporal Encoder for 3D Human Pose
Estimation in Video [75.23812405203778]
Recent solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn-temporal correlation.
We propose Mix Mix, which has temporal transformer block to separately model the temporal motion of each joint and a transformer block inter-joint spatial correlation.
In addition, the network output is extended from the central frame to entire frames of input video, improving the coherence between the input and output benchmarks.
arXiv Detail & Related papers (2022-03-02T04:20:59Z) - Motion Prediction via Joint Dependency Modeling in Phase Space [40.54430409142653]
We introduce a novel convolutional neural model to leverage explicit prior knowledge of motion anatomy.
We then propose a global optimization module that learns the implicit relationships between individual joint features.
Our method is evaluated on large-scale 3D human motion benchmark datasets.
arXiv Detail & Related papers (2022-01-07T08:30:01Z) - Relation-Based Associative Joint Location for Human Pose Estimation in
Videos [5.237054164442403]
We design a lightweight and plug-and-play joint relation extractor (JRE) to model the associative relationship between joints explicitly and automatically.
The JRE flexibly learns the relationship between any two joints, allowing it to learn the rich spatial configuration of human poses.
Then, combined with temporal semantic continuity modeling, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) for video-based human pose estimation.
arXiv Detail & Related papers (2021-07-08T04:05:23Z) - Pose And Joint-Aware Action Recognition [87.4780883700755]
We present a new model for joint-based action recognition, which first extracts motion features from each joint separately through a shared motion encoder.
Our joint selector module re-weights the joint information to select the most discriminative joints for the task.
We show large improvements over the current state-of-the-art joint-based approaches on JHMDB, HMDB, Charades, AVA action recognition datasets.
arXiv Detail & Related papers (2020-10-16T04:43:34Z) - MotioNet: 3D Human Motion Reconstruction from Monocular Video with
Skeleton Consistency [72.82534577726334]
We introduce MotioNet, a deep neural network that directly reconstructs the motion of a 3D human skeleton from monocular video.
Our method is the first data-driven approach that directly outputs a kinematic skeleton, which is a complete, commonly used, motion representation.
arXiv Detail & Related papers (2020-06-22T08:50:09Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.