SPARTAN: Self-supervised Spatiotemporal Transformers Approach to Group
Activity Recognition
- URL: http://arxiv.org/abs/2303.12149v4
- Date: Mon, 28 Aug 2023 14:13:16 GMT
- Title: SPARTAN: Self-supervised Spatiotemporal Transformers Approach to Group
Activity Recognition
- Authors: Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson, Han-Seok Seo,
Xin Li, Page Daniel Dobbs, Khoa Luu
- Abstract summary: We propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (TAN) approach to Group Activity Recognition (GAR) using unlabeled video data.
- Score: 47.3759947287782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this paper, we propose a new, simple, and effective Self-supervised
Spatio-temporal Transformers (SPARTAN) approach to Group Activity Recognition
(GAR) using unlabeled video data. Given a video, we create local and global
Spatio-temporal views with varying spatial patch sizes and frame rates. The
proposed self-supervised objective aims to match the features of these
contrasting views representing the same video to be consistent with the
variations in spatiotemporal domains. To the best of our knowledge, the
proposed mechanism is one of the first works to alleviate the weakly supervised
setting of GAR using the encoders in video transformers. Furthermore, using the
advantage of transformer models, our proposed approach supports long-term
relationship modeling along spatio-temporal dimensions. The proposed SPARTAN
approach performs well on two group activity recognition benchmarks, including
NBA and Volleyball datasets, by surpassing the state-of-the-art results by a
significant margin in terms of MCA and MPCA metrics.
Related papers
- Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - SoGAR: Self-supervised Spatiotemporal Attention-based Social Group
Activity Recognition [47.3759947287782]
This paper introduces a novel approach to Social Group Activity (SoGAR) using Self-supervised Transformers.
Our objective ensures that features extracted from contrasting views were consistent across self-temporal domains.
Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks.
arXiv Detail & Related papers (2023-04-27T03:41:15Z) - Video Frame Interpolation Transformer [86.20646863821908]
We propose a Transformer-based video framework that allows content-aware aggregation weights and considers long-range dependencies with the self-attention operations.
To avoid the high computational cost of global self-attention, we introduce the concept of local attention into video.
In addition, we develop a multi-scale frame scheme to fully realize the potential of Transformers.
arXiv Detail & Related papers (2021-11-27T05:35:10Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - Temporal Action Proposal Generation with Transformers [25.66256889923748]
This paper intuitively presents a unified temporal action proposal generation framework with original Transformers.
The Boundary Transformer captures long-term temporal dependencies to predict precise boundary information.
The Proposal Transformer learns the rich inter-proposal relationships for reliable confidence evaluation.
arXiv Detail & Related papers (2021-05-25T16:22:12Z) - Augmented Transformer with Adaptive Graph for Temporal Action Proposal
Generation [79.98992138865042]
We present an augmented transformer with adaptive graph network (ATAG) to exploit both long-range and local temporal contexts for TAPG.
Specifically, we enhance the vanilla transformer by equipping a snippet actionness loss and a front block, dubbed augmented transformer.
An adaptive graph convolutional network (GCN) is proposed to build local temporal context by mining the position information and difference between adjacent features.
arXiv Detail & Related papers (2021-03-30T02:01:03Z) - Actor-Transformers for Group Activity Recognition [43.60866347282833]
This paper strives to recognize individual actions and group activities from videos.
We propose an actor-transformer model able to learn and selectively extract information relevant for group activity recognition.
arXiv Detail & Related papers (2020-03-28T07:21:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.