OH-Former: Omni-Relational High-Order Transformer for Person
Re-Identification
- URL: http://arxiv.org/abs/2109.11159v1
- Date: Thu, 23 Sep 2021 06:11:38 GMT
- Title: OH-Former: Omni-Relational High-Order Transformer for Person
Re-Identification
- Authors: Xianing Chen, Jialang Xu, Jiale Xu, Shenghua Gao
- Abstract summary: We propose an Omni-Relational High-Order Transformer (OH-Former) to model omni-relational features for person re-identification (ReID)
The experimental results of our model are superior promising, which show state-of-the-art performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
- Score: 30.023365814501137
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Transformers have shown preferable performance on many vision tasks. However,
for the task of person re-identification (ReID), vanilla transformers leave the
rich contexts on high-order feature relations under-exploited and deteriorate
local feature details, which are insufficient due to the dramatic variations of
pedestrians. In this work, we propose an Omni-Relational High-Order Transformer
(OH-Former) to model omni-relational features for ReID. First, to strengthen
the capacity of visual representation, instead of obtaining the attention
matrix based on pairs of queries and isolated keys at each spatial location, we
take a step further to model high-order statistics information for the
non-local mechanism. We share the attention weights in the corresponding layer
of each order with a prior mixing mechanism to reduce the computation cost.
Then, a convolution-based local relation perception module is proposed to
extract the local relations and 2D position information. The experimental
results of our model are superior promising, which show state-of-the-art
performance on Market-1501, DukeMTMC, MSMT17 and Occluded-Duke datasets.
Related papers
- S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Transferring Modality-Aware Pedestrian Attentive Learning for
Visible-Infrared Person Re-identification [43.05147831905626]
We propose a novel Transferring Modality-Aware Pedestrian Attentive Learning (TMPA) model.
TMPA focuses on the pedestrian regions to efficiently compensate for missing modality-specific features.
experiments conducted on the benchmark SYSU-MM01 and RegDB datasets demonstrated the effectiveness of our proposed TMPA model.
arXiv Detail & Related papers (2023-12-12T07:15:17Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - Attention Map Guided Transformer Pruning for Edge Device [98.42178656762114]
Vision transformer (ViT) has achieved promising success in both holistic and occluded person re-identification (Re-ID) tasks.
We propose a novel attention map guided (AMG) transformer pruning method, which removes both redundant tokens and heads.
Comprehensive experiments on Occluded DukeMTMC and Market-1501 demonstrate the effectiveness of our proposals.
arXiv Detail & Related papers (2023-04-04T01:51:53Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
We propose Combiner, which provides full attention capability in each attention head while maintaining low computation complexity.
We show that most sparse attention patterns used in existing sparse transformers are able to inspire the design of such factorization for full attention.
An experimental evaluation on both autoregressive and bidirectional sequence tasks demonstrates the effectiveness of this approach.
arXiv Detail & Related papers (2021-07-12T22:43:11Z) - Robust Person Re-Identification through Contextual Mutual Boosting [77.1976737965566]
We propose the Contextual Mutual Boosting Network (CMBN) to localize pedestrians.
It localizes pedestrians and recalibrates features by effectively exploiting contextual information and statistical inference.
Experiments on the benchmarks demonstrate the superiority of the architecture compared the state-of-the-art.
arXiv Detail & Related papers (2020-09-16T06:33:35Z) - Hierarchical Bi-Directional Feature Perception Network for Person
Re-Identification [12.259747100939078]
Previous Person Re-Identification (Re-ID) models aim to focus on the most discriminative region of an image.
We propose a novel model named Hierarchical Bi-directional Feature Perception Network (HBFP-Net) to correlate multi-level information and reinforce each other.
Experiments implemented on the mainstream evaluation including Market-1501, CUHK03 and DukeMTMC-ReID datasets show that our method outperforms the recent SOTA Re-ID models.
arXiv Detail & Related papers (2020-08-08T12:33:32Z) - Augmented Parallel-Pyramid Net for Attention Guided Pose-Estimation [90.28365183660438]
This paper proposes an augmented parallel-pyramid net with attention partial module and differentiable auto-data augmentation.
We define a new pose search space where the sequences of data augmentations are formulated as a trainable and operational CNN component.
Notably, our method achieves the top-1 accuracy on the challenging COCO keypoint benchmark and the state-of-the-art results on the MPII datasets.
arXiv Detail & Related papers (2020-03-17T03:52:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.