Group Pose: A Simple Baseline for End-to-End Multi-person Pose
Estimation
- URL: http://arxiv.org/abs/2308.07313v1
- Date: Mon, 14 Aug 2023 17:58:04 GMT
- Title: Group Pose: A Simple Baseline for End-to-End Multi-person Pose
Estimation
- Authors: Huan Liu, Qiang Chen, Zichang Tan, Jiang-Jiang Liu, Jian Wang, Xiangbo
Su, Xiaolong Li, Kun Yao, Junyu Han, Errui Ding, Yao Zhao, Jingdong Wang
- Abstract summary: We present a simple yet effective transformer approach, named Group Pose.
We replace single self-attention over all the $Ntimes(K+1)$ queries with two subsequent group self-attentions.
Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods.
- Score: 102.02917299051757
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we study the problem of end-to-end multi-person pose
estimation. State-of-the-art solutions adopt the DETR-like framework, and
mainly develop the complex decoder, e.g., regarding pose estimation as keypoint
box detection and combining with human detection in ED-Pose, hierarchically
predicting with pose decoder and joint (keypoint) decoder in PETR. We present a
simple yet effective transformer approach, named Group Pose. We simply regard
$K$-keypoint pose estimation as predicting a set of $N\times K$ keypoint
positions, each from a keypoint query, as well as representing each pose with
an instance query for scoring $N$ pose predictions. Motivated by the intuition
that the interaction, among across-instance queries of different types, is not
directly helpful, we make a simple modification to decoder self-attention. We
replace single self-attention over all the $N\times(K+1)$ queries with two
subsequent group self-attentions: (i) $N$ within-instance self-attention, with
each over $K$ keypoint queries and one instance query, and (ii) $(K+1)$
same-type across-instance self-attention, each over $N$ queries of the same
type. The resulting decoder removes the interaction among across-instance
type-different queries, easing the optimization and thus improving the
performance. Experimental results on MS COCO and CrowdPose show that our
approach without human box supervision is superior to previous methods with
complex decoders, and even is slightly better than ED-Pose that uses human box
supervision. $\href{https://github.com/Michel-liu/GroupPose-Paddle}{\rm
Paddle}$ and $\href{https://github.com/Michel-liu/GroupPose}{\rm PyTorch}$ code
are available.
Related papers
- Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation [24.973118696495977]
This paper presents a novel end-to-end framework withExplicit box Detection for multi-person Pose estimation, called ED-Pose.
It unifies the contextual learning between human-level (global) and keypoint-level (local) information.
For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone.
arXiv Detail & Related papers (2023-02-03T08:18:34Z) - AdaptivePose++: A Powerful Single-Stage Network for Multi-Person Pose
Regression [66.39539141222524]
We propose to represent the human parts as adaptive points and introduce a fine-grained body representation method.
With the proposed body representation, we deliver a compact single-stage multi-person pose regression network, termed as AdaptivePose.
We employ AdaptivePose for both 2D/3D multi-person pose estimation tasks to verify the effectiveness of AdaptivePose.
arXiv Detail & Related papers (2022-10-08T12:54:20Z) - Pose for Everything: Towards Category-Agnostic Pose Estimation [93.07415325374761]
Category-Agnostic Pose Estimation (CAPE) aims to create a pose estimation model capable of detecting the pose of any class of object given only a few samples with keypoint definition.
A transformer-based Keypoint Interaction Module (KIM) is proposed to capture both the interactions among different keypoints and the relationship between the support and query images.
We also introduce Multi-category Pose (MP-100) dataset, which is a 2D pose dataset of 100 object categories containing over 20K instances and is well-designed for developing CAPE algorithms.
arXiv Detail & Related papers (2022-07-21T09:40:54Z) - Progressive End-to-End Object Detection in Crowded Scenes [96.92416613336096]
Previous query-based detectors suffer from two drawbacks: first, multiple predictions will be inferred for a single object, typically in crowded scenes; second, the performance saturates as the depth of the decoding stage increases.
We propose a progressive predicting method to address the above issues. Specifically, we first select accepted queries to generate true positive predictions, then refine the rest noisy queries according to the previously accepted predictions.
Experiments show that our method can significantly boost the performance of query-based detectors in crowded scenes.
arXiv Detail & Related papers (2022-03-15T06:12:00Z) - Attend to Who You Are: Supervising Self-Attention for Keypoint Detection
and Instance-Aware Association [40.78849763751773]
This paper presents a new method to solve keypoint detection and instance association by using Transformer.
We propose a novel approach of supervising self-attention for multi-person keypoint detection and instance association.
arXiv Detail & Related papers (2021-11-25T03:41:41Z) - Rethinking Keypoint Representations: Modeling Keypoints and Poses as
Objects for Multi-Person Human Pose Estimation [79.78017059539526]
We propose a new heatmap-free keypoint estimation method in which individual keypoints and sets of spatially related keypoints (i.e., poses) are modeled as objects within a dense single-stage anchor-based detection framework.
In experiments, we observe that KAPAO is significantly faster and more accurate than previous methods, which suffer greatly from heatmap post-processing.
Our large model, KAPAO-L, achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without test-time augmentation.
arXiv Detail & Related papers (2021-11-16T15:36:44Z) - Inconsistent Few-Shot Relation Classification via Cross-Attentional
Prototype Networks with Contrastive Learning [16.128652726698522]
We propose Prototype Network-based cross-attention contrastive learning (ProtoCACL) to capture the rich mutual interactions between the support set and query set.
Experimental results demonstrate that our ProtoCACL can outperform the state-of-the-art baseline model under both inconsistent $K$ and inconsistent $N$ settings.
arXiv Detail & Related papers (2021-10-13T07:47:13Z) - Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation [31.468003041368814]
We employ an Hourglass Network to infer all the keypoints from different persons indiscriminately.
We greedily group the candidate keypoints into multiple human poses, utilizing the predicted guiding offsets.
Our approach is comparable to the state of the art on the challenging COCO dataset under fair conditions.
arXiv Detail & Related papers (2021-07-07T09:32:01Z) - Improving Robustness and Generality of NLP Models Using Disentangled
Representations [62.08794500431367]
Supervised neural networks first map an input $x$ to a single representation $z$, and then map $z$ to the output label $y$.
We present methods to improve robustness and generality of NLP models from the standpoint of disentangled representation learning.
We show that models trained with the proposed criteria provide better robustness and domain adaptation ability in a wide range of supervised learning tasks.
arXiv Detail & Related papers (2020-09-21T02:48:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.