Cascade Transformers for End-to-End Person Search
- URL: http://arxiv.org/abs/2203.09642v1
- Date: Thu, 17 Mar 2022 22:42:12 GMT
- Title: Cascade Transformers for End-to-End Person Search
- Authors: Rui Yu, Dawei Du, Rodney LaLonde, Daniel Davila, Christopher Funk,
Anthony Hoogs, Brian Clipp
- Abstract summary: We propose the Cascade Occluded Attention Transformer (COAT) for end-to-end person search.
COAT focuses on detecting people in the first stage, while later stages simultaneously and progressively refine the representation for person detection and re-identification.
We demonstrate the benefits of our method by achieving state-of-the-art performance on two benchmark datasets.
- Score: 18.806369852341334
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The goal of person search is to localize a target person from a gallery set
of scene images, which is extremely challenging due to large scale variations,
pose/viewpoint changes, and occlusions. In this paper, we propose the Cascade
Occluded Attention Transformer (COAT) for end-to-end person search. Our
three-stage cascade design focuses on detecting people in the first stage,
while later stages simultaneously and progressively refine the representation
for person detection and re-identification. At each stage the occluded
attention transformer applies tighter intersection over union thresholds,
forcing the network to learn coarse-to-fine pose/scale invariant features.
Meanwhile, we calculate each detection's occluded attention to differentiate a
person's tokens from other people or the background. In this way, we simulate
the effect of other objects occluding a person of interest at the token-level.
Through comprehensive experiments, we demonstrate the benefits of our method by
achieving state-of-the-art performance on two benchmark datasets.
Related papers
- AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - Learning Feature Recovery Transformer for Occluded Person
Re-identification [71.18476220969647]
We propose a new approach called Feature Recovery Transformer (FRT) to address the two challenges simultaneously.
To reduce the interference of the noise during feature matching, we mainly focus on visible regions that appear in both images and develop a visibility graph to calculate the similarity.
In terms of the second challenge, based on the developed graph similarity, for each query image, we propose a recovery transformer that exploits the feature sets of its $k$-nearest neighbors in the gallery to recover the complete features.
arXiv Detail & Related papers (2023-01-05T02:36:16Z) - Sequential Transformer for End-to-End Person Search [4.920657401819193]
Person search aims to simultaneously localize and recognize a target person from realistic and uncropped gallery images.
In this paper, we propose a novel Sequential Transformer (SeqTR) for end-to-end person search to deal with this challenge.
Our SeqTR contains a detection transformer and a novel re-ID transformer that sequentially addresses detection and re-ID tasks.
arXiv Detail & Related papers (2022-11-06T09:32:30Z) - DPIT: Dual-Pipeline Integrated Transformer for Human Pose Estimation [24.082220581799156]
We propose a novel Dual-Pipeline Integrated Transformer (DPIT) for human pose estimation.
DPIT consists of two branches, the bottom-up branch deals with the whole image to capture the global visual information.
The extracted feature representations from bottom-up and top-down branches are fed into the transformer encoder to fuse the global and local knowledge interactively.
arXiv Detail & Related papers (2022-09-02T10:18:26Z) - Active Gaze Control for Foveal Scene Exploration [124.11737060344052]
We propose a methodology to emulate how humans and robots with foveal cameras would explore a scene.
The proposed method achieves an increase in detection F1-score of 2-3 percentage points for the same number of gaze shifts.
arXiv Detail & Related papers (2022-08-24T14:59:28Z) - Human-Object Interaction Detection via Disentangled Transformer [63.46358684341105]
We present Disentangled Transformer, where both encoder and decoder are disentangled to facilitate learning of two sub-tasks.
Our method outperforms prior work on two public HOI benchmarks by a sizeable margin.
arXiv Detail & Related papers (2022-04-20T08:15:04Z) - Motion-Aware Transformer For Occluded Person Re-identification [1.9899263094148867]
We propose a self-supervised deep learning method to improve the location performance for human parts through occluded person Re-ID.
Unlike previous works, we find that motion information derived from the photos of various human postures can help identify major human body components.
arXiv Detail & Related papers (2022-02-09T02:53:10Z) - End-to-End Trainable Multi-Instance Pose Estimation with Transformers [68.93512627479197]
We propose a new end-to-end trainable approach for multi-instance pose estimation by combining a convolutional neural network with a transformer.
Inspired by recent work on end-to-end trainable object detection with transformers, we use a transformer encoder-decoder architecture together with a bipartite matching scheme to directly regress the pose of all individuals in a given image.
Our model, called POse Estimation Transformer (POET), is trained using a novel set-based global loss that consists of a keypoint loss, a keypoint visibility loss, a center loss and a class loss.
arXiv Detail & Related papers (2021-03-22T18:19:22Z) - Diverse Knowledge Distillation for End-to-End Person Search [81.4926655119318]
Person search aims to localize and identify a specific person from a gallery of images.
Recent methods can be categorized into two groups, i.e., two-step and end-to-end approaches.
We propose a simple yet strong end-to-end network with diverse knowledge distillation to break the bottleneck.
arXiv Detail & Related papers (2020-12-21T09:04:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.