Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation
- URL: http://arxiv.org/abs/2302.01593v1
- Date: Fri, 3 Feb 2023 08:18:34 GMT
- Title: Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation
- Authors: Jie Yang, Ailing Zeng, Shilong Liu, Feng Li, Ruimao Zhang, Lei Zhang
- Abstract summary: This paper presents a novel end-to-end framework withExplicit box Detection for multi-person Pose estimation, called ED-Pose.
It unifies the contextual learning between human-level (global) and keypoint-level (local) information.
For the first time, as a fully end-to-end framework with a L1 regression loss, ED-Pose surpasses heatmap-based Top-down methods under the same backbone.
- Score: 24.973118696495977
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a novel end-to-end framework with Explicit box Detection
for multi-person Pose estimation, called ED-Pose, where it unifies the
contextual learning between human-level (global) and keypoint-level (local)
information. Different from previous one-stage methods, ED-Pose re-considers
this task as two explicit box detection processes with a unified representation
and regression supervision. First, we introduce a human detection decoder from
encoded tokens to extract global features. It can provide a good initialization
for the latter keypoint detection, making the training process converge fast.
Second, to bring in contextual information near keypoints, we regard pose
estimation as a keypoint box detection problem to learn both box positions and
contents for each keypoint. A human-to-keypoint detection decoder adopts an
interactive learning strategy between human and keypoint features to further
enhance global and local feature aggregation. In general, ED-Pose is
conceptually simple without post-processing and dense heatmap supervision. It
demonstrates its effectiveness and efficiency compared with both two-stage and
one-stage methods. Notably, explicit box detection boosts the pose estimation
performance by 4.5 AP on COCO and 9.9 AP on CrowdPose. For the first time, as a
fully end-to-end framework with a L1 regression loss, ED-Pose surpasses
heatmap-based Top-down methods under the same backbone by 1.2 AP on COCO and
achieves the state-of-the-art with 76.6 AP on CrowdPose without bells and
whistles. Code is available at https://github.com/IDEA-Research/ED-Pose.
Related papers
- Disentangled Pre-training for Human-Object Interaction Detection [22.653500926559833]
We propose an efficient disentangled pre-training method for HOI detection (DP-HOI)
DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers.
It significantly enhances the performance of existing HOI detection models on a broad range of rare categories.
arXiv Detail & Related papers (2024-04-02T08:21:16Z) - Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision [81.60564776995682]
We present Point2RBox, an end-to-end solution for point-supervised object detection.
Our method uses a lightweight paradigm, yet it achieves a competitive performance among point-supervised alternatives.
In particular, our method uses a lightweight paradigm, yet it achieves a competitive performance among point-supervised alternatives.
arXiv Detail & Related papers (2023-11-23T15:57:41Z) - Rethinking Keypoint Representations: Modeling Keypoints and Poses as
Objects for Multi-Person Human Pose Estimation [79.78017059539526]
We propose a new heatmap-free keypoint estimation method in which individual keypoints and sets of spatially related keypoints (i.e., poses) are modeled as objects within a dense single-stage anchor-based detection framework.
In experiments, we observe that KAPAO is significantly faster and more accurate than previous methods, which suffer greatly from heatmap post-processing.
Our large model, KAPAO-L, achieves an AP of 70.6 on the Microsoft COCO Keypoints validation set without test-time augmentation.
arXiv Detail & Related papers (2021-11-16T15:36:44Z) - 6D Object Pose Estimation using Keypoints and Part Affinity Fields [24.126513851779936]
The task of 6D object pose estimation from RGB images is an important requirement for autonomous service robots to be able to interact with the real world.
We present a two-step pipeline for estimating the 6 DoF translation and orientation of known objects.
arXiv Detail & Related papers (2021-07-05T14:41:19Z) - A Global to Local Double Embedding Method for Multi-person Pose
Estimation [10.05687757555923]
We present a novel method to simplify the pipeline by implementing person detection and joints detection simultaneously.
We propose a Double Embedding (DE) method to complete the multi-person pose estimation task in a global-to-local way.
We achieve the competitive results on benchmarks MSCOCO, MPII and CrowdPose, demonstrating the effectiveness and generalization ability of our method.
arXiv Detail & Related papers (2021-02-15T03:13:38Z) - Structure-Consistent Weakly Supervised Salient Object Detection with
Local Saliency Coherence [14.79639149658596]
We propose a one-round end-to-end training approach for weakly supervised salient object detection via scribble annotations.
Our method achieves a new state-of-the-art performance on six benchmarks.
arXiv Detail & Related papers (2020-12-08T12:49:40Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z) - Detection in Crowded Scenes: One Proposal, Multiple Predictions [79.28850977968833]
We propose a proposal-based object detector, aiming at detecting highly-overlapped instances in crowded scenes.
The key of our approach is to let each proposal predict a set of correlated instances rather than a single one in previous proposal-based frameworks.
Our detector can obtain 4.9% AP gains on challenging CrowdHuman dataset and 1.0% $textMR-2$ improvements on CityPersons dataset.
arXiv Detail & Related papers (2020-03-20T09:48:53Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z) - PPDM: Parallel Point Detection and Matching for Real-time Human-Object
Interaction Detection [85.75935399090379]
We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps.
It is the first real-time HOI detection method.
arXiv Detail & Related papers (2019-12-30T12:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.