TransReID: Transformer-based Object Re-Identification
- URL: http://arxiv.org/abs/2102.04378v1
- Date: Mon, 8 Feb 2021 17:33:59 GMT
- Title: TransReID: Transformer-based Object Re-Identification
- Authors: Shuting He, Hao Luo, Pichao Wang, Fan Wang, Hao Li, Wei Jiang
- Abstract summary: Vision Transformer (ViT) is a pure transformer-based model for the object re-identification (ReID) task.
With several adaptations, a strong baseline ViT-BoT is constructed with ViT as backbone.
We propose a pure-transformer framework dubbed as TransReID, which is the first work to use a pure Transformer for ReID research.
- Score: 20.02035310635418
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this paper, we explore the Vision Transformer (ViT), a pure
transformer-based model, for the object re-identification (ReID) task. With
several adaptations, a strong baseline ViT-BoT is constructed with ViT as
backbone, which achieves comparable results to convolution neural networks-
(CNN-) based frameworks on several ReID benchmarks. Furthermore, two modules
are designed in consideration of the specialties of ReID data: (1) It is super
natural and simple for Transformer to encode non-visual information such as
camera or viewpoint into vector embedding representations. Plugging into these
embeddings, ViT holds the ability to eliminate the bias caused by diverse
cameras or viewpoints.(2) We design a Jigsaw branch, parallel with the Global
branch, to facilitate the training of the model in a two-branch learning
framework. In the Jigsaw branch, a jigsaw patch module is designed to learn
robust feature representation and help the training of transformer by shuffling
the patches. With these novel modules, we propose a pure-transformer framework
dubbed as TransReID, which is the first work to use a pure Transformer for ReID
research to the best of our knowledge. Experimental results of TransReID are
superior promising, which achieve state-of-the-art performance on both person
and vehicle ReID benchmarks.
Related papers
- Self-Supervised Pre-Training for Table Structure Recognition Transformer [25.04573593082671]
We propose a self-supervised pre-training (SSP) method for table structure recognition transformers.
We discover that the performance gap between the linear projection transformer and the hybrid CNN-transformer can be mitigated by SSP of the visual encoder in the TSR model.
arXiv Detail & Related papers (2024-02-23T19:34:06Z) - Siamese DETR [87.45960774877798]
We present Siamese DETR, a self-supervised pretraining approach for the Transformer architecture in DETR.
We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks.
The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection.
arXiv Detail & Related papers (2023-03-31T15:29:25Z) - BatchFormerV2: Exploring Sample Relationships for Dense Representation
Learning [88.82371069668147]
BatchFormerV2 is a more general batch Transformer module, which enables exploring sample relationships for dense representation learning.
BatchFormerV2 consistently improves current DETR-based detection methods by over 1.3%.
arXiv Detail & Related papers (2022-04-04T05:53:42Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - SOTR: Segmenting Objects with Transformers [0.0]
We present a novel, flexible, and effective transformer-based model for high-quality instance segmentation.
The proposed method, Segmenting Objects with TRansformers (SOTR), simplifies the segmentation pipeline.
Our SOTR performs well on the MS COCO dataset and surpasses state-of-the-art instance segmentation approaches.
arXiv Detail & Related papers (2021-08-15T14:10:11Z) - ResT: An Efficient Transformer for Visual Recognition [5.807423409327807]
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition.
We show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones.
arXiv Detail & Related papers (2021-05-28T08:53:54Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.