HAT: Hierarchical Aggregation Transformers for Person Re-identification
- URL: http://arxiv.org/abs/2107.05946v2
- Date: Wed, 14 Jul 2021 01:42:35 GMT
- Title: HAT: Hierarchical Aggregation Transformers for Person Re-identification
- Authors: Guowen Zhang and Pingping Zhang and Jinqing Qi and Huchuan Lu
- Abstract summary: We take advantages of both CNNs and Transformers for image-based person Re-ID with high performance.
Work is the first to take advantages of both CNNs and Transformers for image-based person Re-ID.
- Score: 87.02828084991062
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, with the advance of deep Convolutional Neural Networks (CNNs),
person Re-Identification (Re-ID) has witnessed great success in various
applications. However, with limited receptive fields of CNNs, it is still
challenging to extract discriminative representations in a global view for
persons under non-overlapped cameras. Meanwhile, Transformers demonstrate
strong abilities of modeling long-range dependencies for spatial and sequential
data. In this work, we take advantages of both CNNs and Transformers, and
propose a novel learning framework named Hierarchical Aggregation Transformer
(HAT) for image-based person Re-ID with high performance. To achieve this goal,
we first propose a Deeply Supervised Aggregation (DSA) to recurrently aggregate
hierarchical features from CNN backbones. With multi-granularity supervisions,
the DSA can enhance multi-scale features for person retrieval, which is very
different from previous methods. Then, we introduce a Transformer-based Feature
Calibration (TFC) to integrate low-level detail information as the global prior
for high-level semantic information. The proposed TFC is inserted to each level
of hierarchical features, resulting in great performance improvements. To our
best knowledge, this work is the first to take advantages of both CNNs and
Transformers for image-based person Re-ID. Comprehensive experiments on four
large-scale Re-ID benchmarks demonstrate that our method shows better results
than several state-of-the-art methods. The code is released at
https://github.com/AI-Zhpp/HAT.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Synthesizing Efficient Data with Diffusion Models for Person Re-Identification Pre-Training [51.87027943520492]
We present a novel paradigm Diffusion-ReID to efficiently augment and generate diverse images based on known identities.
Benefiting from our proposed paradigm, we first create a new large-scale person Re-ID dataset Diff-Person, which consists of over 777K images from 5,183 identities.
arXiv Detail & Related papers (2024-06-10T06:26:03Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - ConvFormer: Combining CNN and Transformer for Medical Image Segmentation [17.88894109620463]
We propose a hierarchical CNN and Transformer hybrid architecture, called ConvFormer, for medical image segmentation.
Our ConvFormer, trained from scratch, outperforms various CNN- or Transformer-based architectures, achieving state-of-the-art performance.
arXiv Detail & Related papers (2022-11-15T23:11:22Z) - HiFormer: Hierarchical Multi-scale Representations Using Transformers
for Medical Image Segmentation [3.478921293603811]
HiFormer is a novel method that efficiently bridges a CNN and a transformer for medical image segmentation.
To secure a fine fusion of global and local features, we propose a Double-Level Fusion (DLF) module in the skip connection of the encoder-decoder structure.
arXiv Detail & Related papers (2022-07-18T11:30:06Z) - Transformer-Guided Convolutional Neural Network for Cross-View
Geolocalization [20.435023745201878]
We propose a novel Transformer-guided convolutional neural network (TransGCNN) architecture.
Our TransGCNN consists of a CNN backbone extracting feature map from an input image and a Transformer head modeling global context.
Experiments on popular benchmark datasets demonstrate that our model achieves top-1 accuracy of 94.12% and 84.92% on CVUSA and CVACT_val, respectively.
arXiv Detail & Related papers (2022-04-21T08:46:41Z) - Rich CNN-Transformer Feature Aggregation Networks for Super-Resolution [50.10987776141901]
Recent vision transformers along with self-attention have achieved promising results on various computer vision tasks.
We introduce an effective hybrid architecture for super-resolution (SR) tasks, which leverages local features from CNNs and long-range dependencies captured by transformers.
Our proposed method achieves state-of-the-art SR results on numerous benchmark datasets.
arXiv Detail & Related papers (2022-03-15T06:52:25Z) - CCTrans: Simplifying and Improving Crowd Counting with Transformer [7.597392692171026]
We propose a simple approach called CCTrans to simplify the design pipeline.
Specifically, we utilize a pyramid vision transformer backbone to capture the global crowd information.
Our method achieves new state-of-the-art results on several benchmarks both in weakly and fully-supervised crowd counting.
arXiv Detail & Related papers (2021-09-29T15:13:10Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Learning Deep Interleaved Networks with Asymmetric Co-Attention for
Image Restoration [65.11022516031463]
We present a deep interleaved network (DIN) that learns how information at different states should be combined for high-quality (HQ) images reconstruction.
In this paper, we propose asymmetric co-attention (AsyCA) which is attached at each interleaved node to model the feature dependencies.
Our presented DIN can be trained end-to-end and applied to various image restoration tasks.
arXiv Detail & Related papers (2020-10-29T15:32:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.