DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation
- URL: http://arxiv.org/abs/2209.02431v1
- Date: Fri, 2 Sep 2022 10:18:26 GMT
- Title: DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation
- Authors: Shuaitao Zhao, Kun Liu, Yuhang Huang, Qian Bao, Dan Zeng, and Wu Liu
- Abstract summary: We propose a novel Dual-Pipeline Integrated Transformer (DPIT) for human pose estimation.
DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information.
The extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively.
- Score: 24.082220581799156
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human pose estimation aims to figure out the keypoints of all people in
different scenes. Current approaches still face some challenges despite
promising results. Existing top-down methods deal with a single person
individually, without the interaction between different people and the scene
they are situated in. Consequently, the performance of human detection degrades
when serious occlusion happens. On the other hand, existing bottom-up methods
consider all people at the same time and capture the global knowledge of the
entire image. However, they are less accurate than the top-down methods due to
the scale variation. To address these problems, we propose a novel
Dual-Pipeline Integrated Transformer (DPIT) by integrating top-down and
bottom-up pipelines to explore the visual clues of different receptive fields
and achieve their complementarity. Specifically, DPIT consists of two branches,
the bottom-up branch deals with the whole image to capture the global visual
information, while the top-down branch extracts the feature representation of
local vision from the single-human bounding box. Then, the extracted feature
representations from bottom-up and top-down branches are fed into the
transformer encoder to fuse the global and local knowledge interactively.
Moreover, we define the keypoint queries to explore both full-scene and
single-human posture visual clues to realize the mutual complementarity of the
two pipelines. To the best of our knowledge, this is one of the first works to
integrate the bottom-up and top-down pipelines with transformers for human pose
estimation. Extensive experiments on COCO and MPII datasets demonstrate that
our DPIT achieves comparable performance to the state-of-the-art methods.
Related papers
- AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - Learning Feature Recovery Transformer for Occluded Person
Re-identification [71.18476220969647]
We propose a new approach called Feature Recovery Transformer (FRT) to address the two challenges simultaneously.
To reduce the interference of the noise during feature matching, we mainly focus on visible regions that appear in both images and develop a visibility graph to calculate the similarity.
In terms of the second challenge, based on the developed graph similarity, for each query image, we propose a recovery transformer that exploits the feature sets of its $k$-nearest neighbors in the gallery to recover the complete features.
arXiv Detail & Related papers (2023-01-05T02:36:16Z) - Sequential Transformer for End-to-End Person Search [4.920657401819193]
Person search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images.
In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge.
Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks.
arXiv Detail & Related papers (2022-11-06T09:32:30Z) - Cascade Transformers for End-to-End Person Search [18.806369852341334]
We propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search.
COAT focuses on detecting people in the first stage, while later stages simultaneously and progressively refine the representation for person detection and re-identification.
We demonstrate the benefits of our method by achieving state-of-the-art performance on two benchmark datasets.
arXiv Detail & Related papers (2022-03-17T22:42:12Z) - Monocular 3D Multi-Person Pose Estimation by Integrating Top-Down and
Bottom-Up Networks [33.974241749058585]
Multi-person pose estimation can cause human detection to be erroneous and human-joints grouping to be unreliable.
Existing top-down methods rely on human detection and thus suffer from these problems.
We propose the integration of top-down and bottom-up approaches to exploit their strengths.
arXiv Detail & Related papers (2021-04-05T07:05:21Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - A Global to Local Double Embedding Method for Multi-person Pose
Estimation [10.05687757555923]
We present a novel method to simplify the pipeline by implementing person detection and joints detection simultaneously.
We propose a Double Embedding (DE) method to complete the multi-person pose estimation task in a global-to-local way.
We achieve the competitive results on benchmarks MSCOCO, MPII and CrowdPose, demonstrating the effectiveness and generalization ability of our method.
arXiv Detail & Related papers (2021-02-15T03:13:38Z) - AdaFuse: Adaptive Multiview Fusion for Accurate Human Pose Estimation in
the Wild [77.43884383743872]
We present AdaFuse, an adaptive multiview fusion method to enhance the features in occluded views.
We extensively evaluate the approach on three public datasets including Human3.6M, Total Capture and CMU Panoptic.
We also create a large scale synthetic dataset Occlusion-Person, which allows us to perform numerical evaluation on the occluded joints.
arXiv Detail & Related papers (2020-10-26T03:19:46Z) - Gradient-Induced Co-Saliency Detection [81.54194063218216]
Co-saliency detection (Co-SOD) aims to segment the common salient foreground in a group of relevant images.
In this paper, inspired by human behavior, we propose a gradient-induced co-saliency detection method.
arXiv Detail & Related papers (2020-04-28T08:40:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.