Learning Quality-aware Representation for Multi-person Pose Regression
- URL: http://arxiv.org/abs/2201.01087v1
- Date: Tue, 4 Jan 2022 11:10:28 GMT
- Title: Learning Quality-aware Representation for Multi-person Pose Regression
- Authors: Yabo Xiao, Dongdong Yu, Xiaojuan Wang, Lei Jin, Guoli Wang, Qian Zhang
- Abstract summary: We learn the pose regression quality-aware representation.
Our method achieves the state-of-the-art result of 71.7 AP on MS COCO test-dev set.
- Score: 8.83185608408674
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Off-the-shelf single-stage multi-person pose regression methods generally
leverage the instance score (i.e., confidence of the instance localization) to
indicate the pose quality for selecting the pose candidates. We consider that
there are two gaps involved in existing paradigm:~1) The instance score is not
well interrelated with the pose regression quality.~2) The instance feature
representation, which is used for predicting the instance score, does not
explicitly encode the structural pose information to predict the reasonable
score that represents pose regression quality. To address the aforementioned
issues, we propose to learn the pose regression quality-aware representation.
Concretely, for the first gap, instead of using the previous instance
confidence label (e.g., discrete {1,0} or Gaussian representation) to denote
the position and confidence for person instance, we firstly introduce the
Consistent Instance Representation (CIR) that unifies the pose regression
quality score of instance and the confidence of background into a pixel-wise
score map to calibrates the inconsistency between instance score and pose
regression quality. To fill the second gap, we further present the Query
Encoding Module (QEM) including the Keypoint Query Encoding (KQE) to encode the
positional and semantic information for each keypoint and the Pose Query
Encoding (PQE) which explicitly encodes the predicted structural pose
information to better fit the Consistent Instance Representation (CIR). By
using the proposed components, we significantly alleviate the above gaps. Our
method outperforms previous single-stage regression-based even bottom-up
methods and achieves the state-of-the-art result of 71.7 AP on MS COCO test-dev
set.
Related papers
- Regression-free Blind Image Quality Assessment with Content-Distortion
Consistency [42.683300312253884]
We propose a regression-free framework for image quality evaluation.
It is based upon retrieving locally similar instances by incorporating semantic and distortion feature spaces.
The proposed method achieves competitive, even superior performance compared to state-of-the-art regression-based methods.
arXiv Detail & Related papers (2023-07-18T14:19:28Z) - DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and
Grounding [34.078590816368056]
We study the problem of visual grounding by considering both phrase extraction and grounding (PEG)
PEG requires a model to extract phrases from text and locate objects from images simultaneously.
We propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text.
arXiv Detail & Related papers (2022-11-28T16:30:46Z) - SSP-Pose: Symmetry-Aware Shape Prior Deformation for Direct
Category-Level Object Pose Estimation [77.88624073105768]
Category-level pose estimation is a challenging problem due to intra-class shape variations.
We propose an end-to-end trainable network SSP-Pose for category-level pose estimation.
SSP-Pose produces superior performance compared with competitors with a real-time inference speed at about 25Hz.
arXiv Detail & Related papers (2022-08-13T14:37:31Z) - Action Quality Assessment with Temporal Parsing Transformer [84.1272079121699]
Action Quality Assessment (AQA) is important for action understanding and resolving the task poses unique challenges due to subtle visual differences.
We propose a temporal parsing transformer to decompose the holistic feature into temporal part-level representations.
Our proposed method outperforms prior work on three public AQA benchmarks by a considerable margin.
arXiv Detail & Related papers (2022-07-19T13:29:05Z) - Poseur: Direct Human Pose Regression with Transformers [119.79232258661995]
We propose a direct, regression-based approach to 2D human pose estimation from single images.
Our framework is end-to-end differentiable, and naturally learns to exploit the dependencies between keypoints.
Ours is the first regression-based approach to perform favorably compared to the best heatmap-based pose estimation methods.
arXiv Detail & Related papers (2022-01-19T04:31:57Z) - Bottom-Up Human Pose Estimation Via Disentangled Keypoint Regression [81.05772887221333]
We study the dense keypoint regression framework that is previously inferior to the keypoint detection and grouping framework.
We present a simple yet effective approach, named disentangled keypoint regression (DEKR)
We empirically show that the proposed direct regression method outperforms keypoint detection and grouping methods.
arXiv Detail & Related papers (2021-04-06T05:54:46Z) - Point-Set Anchors for Object Detection, Instance Segmentation and Pose
Estimation [85.96410825961966]
We argue that the image features extracted at a central point contain limited information for predicting distant keypoints or bounding box boundaries.
To facilitate inference, we propose to instead perform regression from a set of points placed at more advantageous positions.
We apply this proposed framework, called Point-Set Anchors, to object detection, instance segmentation, and human pose estimation.
arXiv Detail & Related papers (2020-07-06T15:59:56Z) - Bottom-Up Human Pose Estimation by Ranking Heatmap-Guided Adaptive
Keypoint Estimates [76.51095823248104]
We present several schemes that are rarely or unthoroughly studied before for improving keypoint detection and grouping (keypoint regression) performance.
First, we exploit the keypoint heatmaps for pixel-wise keypoint regression instead of separating them for improving keypoint regression.
Second, we adopt a pixel-wise spatial transformer network to learn adaptive representations for handling the scale and orientation variance.
Third, we present a joint shape and heatvalue scoring scheme to promote the estimated poses that are more likely to be true poses.
arXiv Detail & Related papers (2020-06-28T01:14:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.