Optimizing Human Pose Estimation Through Focused Human and Joint Regions
- URL: http://arxiv.org/abs/2501.14439v1
- Date: Fri, 24 Jan 2025 12:17:47 GMT
- Title: Optimizing Human Pose Estimation Through Focused Human and Joint Regions
- Authors: Yingying Jiao, Zhigang Wang, Zhenguang Liu, Shaojing Fan, Sifan Wu, Zheqi Wu, Zhuoyue Xu,
- Abstract summary: Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance.
Previous methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people.
We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions.
Our method achieves state-of-the-art performance on three large-scale benchmark
- Score: 18.37601213802529
- License:
- Abstract: Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.
Related papers
- HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes [10.237077867790612]
We present HOIMotion, a novel approach for human motion forecasting during human-object interactions.
Our method integrates information about past body poses and egocentric 3D object bounding boxes.
We show that HOIMotion consistently outperforms state-of-the-art methods by a large margin.
arXiv Detail & Related papers (2024-07-02T19:58:35Z) - View Consistent Purification for Accurate Cross-View Localization [59.48131378244399]
This paper proposes a fine-grained self-localization method for outdoor robotics.
The proposed method addresses limitations in existing cross-view localization methods.
It is the first sparse visual-only method that enhances perception in dynamic environments.
arXiv Detail & Related papers (2023-08-16T02:51:52Z) - DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose
Estimation [16.32910684198013]
We present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem.
We show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model.
arXiv Detail & Related papers (2023-07-31T14:00:23Z) - Weakly-supervised 3D Pose Transfer with Keypoints [57.66991032263699]
Main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies.
We propose a novel weakly-supervised keypoint-based framework to overcome these difficulties.
arXiv Detail & Related papers (2023-07-25T12:40:24Z) - A survey of top-down approaches for human pose estimation [0.0]
State-of-the-art methods implemented with Deep Learning have brought remarkable results in the field of human pose estimation.
This paper aims to provide newcomers with an extensive review of deep learning methods-based 2D images for recognizing the pose of people.
arXiv Detail & Related papers (2022-02-05T23:27:46Z) - Improving Robustness and Accuracy via Relative Information Encoding in
3D Human Pose Estimation [59.94032196768748]
We propose a relative information encoding method that yields positional and temporal enhanced representations.
Our method outperforms state-of-the-art methods on two public datasets.
arXiv Detail & Related papers (2021-07-29T14:12:19Z) - Learning Dynamics via Graph Neural Networks for Human Pose Estimation
and Tracking [98.91894395941766]
We propose a novel online approach to learning the pose dynamics, which are independent of pose detections in current fame.
Specifically, we derive this prediction of dynamics through a graph neural network(GNN) that explicitly accounts for both spatial-temporal and visual information.
Experiments on PoseTrack 2017 and PoseTrack 2018 datasets demonstrate that the proposed method achieves results superior to the state of the art on both human pose estimation and tracking tasks.
arXiv Detail & Related papers (2021-06-07T16:36:50Z) - Regional Attention Network (RAN) for Head Pose and Fine-grained Gesture
Recognition [9.131161856493486]
We propose a novel end-to-end textbfRegional Attention Network (RAN), which is a fully Convolutional Neural Network (CNN)
Our regions consist of one or more consecutive cells and are adapted from the strategies used in computing HOG (Histogram of Oriented Gradient) descriptor.
The proposed approach outperforms the state-of-the-art by a considerable margin in different metrics.
arXiv Detail & Related papers (2021-01-17T10:14:28Z) - A review of 3D human pose estimation algorithms for markerless motion
capture [0.0]
We review the leading human pose estimation methods of the past five years, focusing on metrics, benchmarks and method structures.
We propose a taxonomy based on accuracy, speed and robustness that we use to classify de methods and derive directions for future research.
arXiv Detail & Related papers (2020-10-13T15:07:01Z) - Multi-person 3D Pose Estimation in Crowded Scenes Based on Multi-View
Geometry [62.29762409558553]
Epipolar constraints are at the core of feature matching and depth estimation in multi-person 3D human pose estimation methods.
Despite the satisfactory performance of this formulation in sparser crowd scenes, its effectiveness is frequently challenged under denser crowd circumstances.
In this paper, we depart from the multi-person 3D pose estimation formulation, and instead reformulate it as crowd pose estimation.
arXiv Detail & Related papers (2020-07-21T17:59:36Z) - Towards High Performance Human Keypoint Detection [87.1034745775229]
We find that context information plays an important role in reasoning human body configuration and invisible keypoints.
Inspired by this, we propose a cascaded context mixer ( CCM) which efficiently integrates spatial and channel context information.
To maximize CCM's representation capability, we develop a hard-negative person detection mining strategy and a joint-training strategy.
We present several sub-pixel refinement techniques for postprocessing keypoint predictions to improve detection accuracy.
arXiv Detail & Related papers (2020-02-03T02:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.