A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification
- URL: http://arxiv.org/abs/2104.01745v1
- Date: Mon, 5 Apr 2021 02:50:16 GMT
- Title: A Video Is Worth Three Views: Trigeminal Transformers for Video-based
Person Re-identification
- Authors: Xuehu Liu and Pingping Zhang and Chenyang Yu and Huchuan Lu and
Xuesheng Qian and Xiaoyun Yang
- Abstract summary: Video-based person re-identification (Re-ID) aims to retrieve video sequences of the same person under non-overlapping cameras.
We propose a novel framework named Trigeminal Transformers (TMT) for video-based person Re-ID.
- Score: 77.08204941207985
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Video-based person re-identification (Re-ID) aims to retrieve video sequences
of the same person under non-overlapping cameras. Previous methods usually
focus on limited views, such as spatial, temporal or spatial-temporal view,
which lack of the observations in different feature domains. To capture richer
perceptions and extract more comprehensive video representations, in this paper
we propose a novel framework named Trigeminal Transformers (TMT) for
video-based person Re-ID. More specifically, we design a trigeminal feature
extractor to jointly transform raw video data into spatial, temporal and
spatial-temporal domain. Besides, inspired by the great success of vision
transformer, we introduce the transformer structure for video-based person
Re-ID. In our work, three self-view transformers are proposed to exploit the
relationships between local features for information enhancement in spatial,
temporal and spatial-temporal domains. Moreover, a cross-view transformer is
proposed to aggregate the multi-view features for comprehensive video
representations. The experimental results indicate that our approach can
achieve better performance than other state-of-the-art approaches on public
Re-ID benchmarks. We will release the code for model reproduction.
Related papers
- SITAR: Semi-supervised Image Transformer for Action Recognition [20.609596080624662]
This paper addresses video action recognition in a semi-supervised setting by leveraging only a handful of labeled videos.
We capitalize on the vast pool of unlabeled samples and employ contrastive learning on the encoded super images.
Our method demonstrates superior performance compared to existing state-of-the-art approaches for semi-supervised action recognition.
arXiv Detail & Related papers (2024-09-04T17:49:54Z) - Multi-entity Video Transformers for Fine-Grained Video Representation
Learning [36.31020249963468]
We re-examine the design of transformer architectures for video representation learning.
A salient aspect of our self-supervised method is the improved integration of spatial information in the temporal pipeline.
Our Multi-entity Video Transformer (MV-Former) architecture achieves state-of-the-art results on multiple fine-grained video benchmarks.
arXiv Detail & Related papers (2023-11-17T21:23:12Z) - VDT: General-purpose Video Diffusion Transformers via Mask Modeling [62.71878864360634]
Video Diffusion Transformer (VDT) pioneers the use of transformers in diffusion-based video generation.
We propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios.
arXiv Detail & Related papers (2023-05-22T17:59:45Z) - Deeply-Coupled Convolution-Transformer with Spatial-temporal
Complementary Learning for Video-based Person Re-identification [91.56939957189505]
We propose a novel spatial-temporal complementary learning framework named Deeply-Coupled Convolution-Transformer (DCCT) for high-performance video-based person Re-ID.
Our framework could attain better performances than most state-of-the-art methods.
arXiv Detail & Related papers (2023-04-27T12:16:44Z) - Long-Short Temporal Contrastive Learning of Video Transformers [62.71874976426988]
Self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results on par or better than those obtained with supervised pretraining on large-scale image datasets.
Our approach, named Long-Short Temporal Contrastive Learning, enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent.
arXiv Detail & Related papers (2021-06-17T02:30:26Z) - Space-time Mixing Attention for Video Transformer [55.50839896863275]
We propose a Video Transformer model the complexity of which scales linearly with the number of frames in the video sequence.
We demonstrate that our model produces very high recognition accuracy on the most popular video recognition datasets.
arXiv Detail & Related papers (2021-06-10T17:59:14Z) - Spatiotemporal Transformer for Video-based Person Re-identification [102.58619642363958]
We show that, despite the strong learning ability, the vanilla Transformer suffers from an increased risk of over-fitting.
We propose a novel pipeline where the model is pre-trained on a set of synthesized video data and then transferred to the downstream domains.
The derived algorithm achieves significant accuracy gain on three popular video-based person re-identification benchmarks.
arXiv Detail & Related papers (2021-03-30T16:19:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.