Person Re-Identification with a Locally Aware Transformer
- URL: http://arxiv.org/abs/2106.03720v2
- Date: Tue, 8 Jun 2021 17:59:48 GMT
- Title: Person Re-Identification with a Locally Aware Transformer
- Authors: Charu Sharma, Siddhant R. Kapil, David Chapman
- Abstract summary: We propose a novel Locally Aware Transformer (LA-Transformer) that employs a Parts-based Convolution Baseline (PCB)-inspired strategy for aggregating globally enhanced local classification tokens.
LA-Transformer with blockwise fine-tuning achieves rank-1 accuracy of $98.27 %$ with standard deviation of $0.13$ on the Market-1501 and $98.7%$ with standard deviation of $0.1$ on the CUHK03 dataset respectively.
- Score: 9.023847175654602
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Person Re-Identification is an important problem in computer vision-based
surveillance applications, in which the same person is attempted to be
identified from surveillance photographs in a variety of nearby zones. At
present, the majority of Person re-ID techniques are based on Convolutional
Neural Networks (CNNs), but Vision Transformers are beginning to displace pure
CNNs for a variety of object recognition tasks. The primary output of a vision
transformer is a global classification token, but vision transformers also
yield local tokens which contain additional information about local regions of
the image. Techniques to make use of these local tokens to improve
classification accuracy are an active area of research. We propose a novel
Locally Aware Transformer (LA-Transformer) that employs a Parts-based
Convolution Baseline (PCB)-inspired strategy for aggregating globally enhanced
local classification tokens into an ensemble of $\sqrt{N}$ classifiers, where
$N$ is the number of patches. An additional novelty is that we incorporate
blockwise fine-tuning which further improves re-ID accuracy. LA-Transformer
with blockwise fine-tuning achieves rank-1 accuracy of $98.27 \%$ with standard
deviation of $0.13$ on the Market-1501 and $98.7\%$ with standard deviation of
$0.2$ on the CUHK03 dataset respectively, outperforming all other
state-of-the-art published methods at the time of writing.
Related papers
- Global-Local Similarity for Efficient Fine-Grained Image Recognition with Vision Transformers [5.825612611197359]
Fine-grained recognition involves the classification of images from subordinate macro-categories.
We propose a novel and computationally inexpensive metric to identify discriminative regions in an image.
Our method achieves these results at a much lower computational cost compared to the alternatives.
arXiv Detail & Related papers (2024-07-17T10:04:54Z) - Leveraging Swin Transformer for Local-to-Global Weakly Supervised
Semantic Segmentation [12.103012959947055]
This work explores the use of Swin Transformer by proposing "SWTformer" to enhance the accuracy of the initial seed CAMs.
SWTformer-V1 achieves a 0.98% mAP higher localization accuracy, outperforming state-of-the-art models.
SWTformer-V2 incorporates a multi-scale feature fusion mechanism to extract additional information.
arXiv Detail & Related papers (2024-01-31T13:41:17Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Part-Aware Transformer for Generalizable Person Re-identification [138.99827526048205]
Domain generalization person re-identification (DG-ReID) aims to train a model on source domains and generalize well on unseen domains.
We propose a pure Transformer model (termed Part-aware Transformer) for DG-ReID by designing a proxy task, named Cross-ID Similarity Learning (CSL)
This proxy task allows the model to learn generic features because it only cares about the visual similarity of the parts regardless of the ID labels.
arXiv Detail & Related papers (2023-08-07T06:15:51Z) - $R^{2}$Former: Unified $R$etrieval and $R$eranking Transformer for Place
Recognition [92.56937383283397]
We propose a unified place recognition framework that handles both retrieval and reranking.
The proposed reranking module takes feature correlation, attention value, and xy coordinates into account.
$R2$Former significantly outperforms state-of-the-art methods on major VPR datasets.
arXiv Detail & Related papers (2023-04-06T23:19:32Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Dynamic Token Normalization Improves Vision Transformer [48.63155906080236]
We propose a new normalizer, termed Dynamic Token Normalization (DTN)
DTN learns to normalize tokens in both intra-token and inter-token manners.
It consistently outperforms baseline model with minimal extra parameters and computational overhead.
arXiv Detail & Related papers (2021-12-05T17:04:59Z) - Global Interaction Modelling in Vision Transformer via Super Tokens [20.700750237972155]
Window-based local attention is one of the major techniques being adopted in recent works.
We present a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention.
In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy.
arXiv Detail & Related papers (2021-11-25T16:22:57Z) - Vision Transformer with Progressive Sampling [73.60630716500154]
We propose an iterative and progressive sampling strategy to locate discriminative regions.
When trained from scratch on ImageNet, PS-ViT performs 3.8% higher than the vanilla ViT in terms of top-1 accuracy.
arXiv Detail & Related papers (2021-08-03T18:04:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.