RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
- URL: http://arxiv.org/abs/2312.07526v2
- Date: Mon, 8 Apr 2024 13:40:43 GMT
- Title: RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation
- Authors: Peng Lu, Tao Jiang, Yining Li, Xiangtai Li, Kai Chen, Wenming Yang,
- Abstract summary: RTMO is a one-stage pose estimation framework that seamlessly integrates coordinate classification.
It achieves accuracy comparable to top-down methods while maintaining high speed.
Our largest model, RTMO-l, attains 74.8% AP on COCO val 2017 and 141 FPS on a single V100 GPU.
- Score: 46.659592045271125
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases, existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO, a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture, achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning, specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators, achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model, RTMO-l, attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU, demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo.
Related papers
- Joint Coordinate Regression and Association For Multi-Person Pose Estimation, A Pure Neural Network Approach [3.7878984912613256]
We introduce a novel one-stage end-to-end multi-person 2D pose estimation algorithm, known as Joint Coordinate Regression and Association (JCRA)
The proposed algorithm is fast, accurate, effective, and simple. The one-stage end-to-end network architecture significantly improves the inference speed of JCRA.
Extensive experiments on the MS COCO and CrowdPose benchmarks demonstrate that JCRA outperforms state-of-the-art approaches in both accuracy and efficiency.
arXiv Detail & Related papers (2023-07-03T13:40:20Z) - TAPIR: Tracking Any Point with per-frame Initialization and temporal
Refinement [64.11385310305612]
We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence.
Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations.
The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS.
arXiv Detail & Related papers (2023-06-14T17:07:51Z) - Efficient Adaptive Ensembling for Image Classification [3.7241274058257092]
We propose a novel method to boost image classification performances without increasing complexity.
We trained two EfficientNet-b0 end-to-end models on disjoint subsets of data.
We were able to outperform the state-of-the-art by an average of 0.5$%$ on the accuracy.
arXiv Detail & Related papers (2022-06-15T08:55:47Z) - Rethinking Keypoint Representations: Modeling Keypoints and Poses as
Objects for Multi-Person Human Pose Estimation [79.78017059539526]
We propose a new heatmap-free keypoint estimation method in which individual keypoints and sets of spatially related keypoints (i.e., poses) are modeled as objects within a dense single-stage anchor-based detection framework.
In experiments, we observe that KAPAO is significantly faster and more accurate than previous methods, which suffer greatly from heatmap post-processing.
Our large model, KAPAO-L, achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without test-time augmentation.
arXiv Detail & Related papers (2021-11-16T15:36:44Z) - ZARTS: On Zero-order Optimization for Neural Architecture Search [94.41017048659664]
Differentiable architecture search (DARTS) has been a popular one-shot paradigm for NAS due to its high efficiency.
This work turns to zero-order optimization and proposes a novel NAS scheme, called ZARTS, to search without enforcing the above approximation.
In particular, results on 12 benchmarks verify the outstanding robustness of ZARTS, where the performance of DARTS collapses due to its known instability issue.
arXiv Detail & Related papers (2021-10-10T09:35:15Z) - FasterPose: A Faster Simple Baseline for Human Pose Estimation [65.8413964785972]
We propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose.
We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence.
Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy.
arXiv Detail & Related papers (2021-07-07T13:39:08Z) - SIMPLE: SIngle-network with Mimicking and Point Learning for Bottom-up
Human Pose Estimation [81.03485688525133]
We propose a novel multi-person pose estimation framework, SIngle-network with Mimicking and Point Learning for Bottom-up Human Pose Estimation (SIMPLE)
Specifically, in the training process, we enable SIMPLE to mimic the pose knowledge from the high-performance top-down pipeline.
Besides, SIMPLE formulates human detection and pose estimation as a unified point learning framework to complement each other in single-network.
arXiv Detail & Related papers (2021-04-06T13:12:51Z) - JGR-P2O: Joint Graph Reasoning based Pixel-to-Offset Prediction Network
for 3D Hand Pose Estimation from a Single Depth Image [28.753759115780515]
State-of-the-art single depth image-based 3D hand pose estimation methods are based on dense predictions.
A novel pixel-wise prediction-based method is proposed to address the above issues.
The proposed model is implemented with an efficient 2D fully convolutional network backbone and has only about 1.4M parameters.
arXiv Detail & Related papers (2020-07-09T08:57:19Z) - Single upper limb pose estimation method based on improved stacked
hourglass network [5.342260499725028]
It is difficult to achieve both high accuracy and real-time performance in single-person pose estimation.
This paper proposes a single-person upper limb pose estimation method based on an end-to-end approach.
arXiv Detail & Related papers (2020-04-16T04:48:40Z) - Compression of descriptor models for mobile applications [26.498907514590165]
We evaluate the computational cost, model size, and matching accuracy tradeoffs for deep neural networks.
We observe a significant redundancy in the learned weights, which we exploit through the use of depthwise separable layers.
We propose the Convolution-Depthwise-Pointwise(CDP) layer, which provides a means of interpolating between the standard and depthwise separable convolutions.
arXiv Detail & Related papers (2020-01-09T17:00:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.