Spatiotemporal Transformer for Video-based Person Re-identification
- URL: http://arxiv.org/abs/2103.16469v1
- Date: Tue, 30 Mar 2021 16:19:27 GMT
- Title: Spatiotemporal Transformer for Video-based Person Re-identification
- Authors: Tianyu Zhang, Longhui Wei, Lingxi Xie, Zijie Zhuang, Yongfei Zhang, Bo
Li, Qi Tian
- Abstract summary: We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
- Score: 102.58619642363958
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recently, the Transformer module has been transplanted from natural language
processing to computer vision. This paper applies the Transformer to
video-based person re-identification, where the key issue is to extract the
discriminative information from a tracklet. We show that, despite the strong
learning ability, the vanilla Transformer suffers from an increased risk of
over-fitting, arguably due to a large number of attention parameters and
insufficient training data. To solve this problem, we propose a novel pipeline
where the model is pre-trained on a set of synthesized video data and then
transferred to the downstream domains with the perception-constrained
Spatiotemporal Transformer (STT) module and Global Transformer (GT) module. The
derived algorithm achieves significant accuracy gain on three popular
video-based person re-identification benchmarks, MARS, DukeMTMC-VideoReID, and
LS-VID, especially when the training and testing data are from different
domains. More importantly, our research sheds light on the application of the
Transformer on highly-structured visual data.
Related papers
- iTransformer: Inverted Transformers Are Effective for Time Series Forecasting [62.40166958002558]
We propose iTransformer, which simply applies the attention and feed-forward network on the inverted dimensions.
The iTransformer model achieves state-of-the-art on challenging real-world datasets.
arXiv Detail & Related papers (2023-10-10T13:44:09Z) - On the Effect of Pre-training for Transformer in Different Modality on
Offline Reinforcement Learning [0.0]
We investigate how pre-training on data of different modalities, such as language and vision, affects fine-tuning of Transformer-based models to Mujoco offline reinforcement learning tasks.
arXiv Detail & Related papers (2022-11-17T13:34:08Z) - Pix4Point: Image Pretrained Standard Transformers for 3D Point Cloud
Understanding [62.502694656615496]
We present Progressive Point Patch Embedding and present a new point cloud Transformer model namely PViT.
PViT shares the same backbone as Transformer but is shown to be less hungry for data, enabling Transformer to achieve performance comparable to the state-of-the-art.
We formulate a simple yet effective pipeline dubbed "Pix4Point" that allows harnessing Transformers pretrained in the image domain to enhance downstream point cloud understanding.
arXiv Detail & Related papers (2022-08-25T17:59:29Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Visformer: The Vision-friendly Transformer [105.52122194322592]
We propose a new architecture named Visformer, which is abbreviated from the Vision-friendly Transformer'
With the same computational complexity, Visformer outperforms both the Transformer-based and convolution-based models in terms of ImageNet classification accuracy.
arXiv Detail & Related papers (2021-04-26T13:13:03Z) - A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification [77.08204941207985]
Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
arXiv Detail & Related papers (2021-04-05T02:50:16Z) - Toward Transformer-Based Object Detection [12.704056181392415]
Vision Transformers can be used as a backbone by a common detection task head to produce competitive COCO results.
ViT-FRCNN demonstrates several known properties associated with transformers, including large pretraining capacity and fast fine-tuning performance.
We view ViT-FRCNN as an important stepping stone toward a pure-transformer solution of complex vision tasks such as object detection.
arXiv Detail & Related papers (2020-12-17T22:33:14Z) - Developing Real-time Streaming Transformer Transducer for Speech
Recognition on Large-scale Dataset [37.619200507404145]
Transformer Transducer (T-T) models for the fist pass decoding with low latency and fast speed on a large-scale dataset.
We combine the idea of Transformer-XL and chunk-wise streaming processing to design a streamable Transformer Transducer model.
We demonstrate that T-T outperforms the hybrid model, RNN Transducer (RNN-T), and streamable Transformer attention-based encoder-decoder model in the streaming scenario.
arXiv Detail & Related papers (2020-10-22T03:01:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.