SoGAR: Self-supervised Spatiotemporal Attention-based Social Group
Activity Recognition
- URL: http://arxiv.org/abs/2305.06310v3
- Date: Mon, 28 Aug 2023 14:18:25 GMT
- Title: SoGAR: Self-supervised Spatiotemporal Attention-based Social Group
Activity Recognition
- Authors: Naga VS Raviteja Chappa, Pha Nguyen, Alexander H Nelson, Han-Seok Seo,
Xin Li, Page Daniel Dobbs, Khoa Luu
- Abstract summary: This paper introduces a novel approach to Social Group Activity (SoGAR) using Self-supervised Transformers.
Our objective ensures that features extracted from contrasting views were consistent across self-temporal domains.
Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks.
- Score: 47.3759947287782
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper introduces a novel approach to Social Group Activity Recognition
(SoGAR) using Self-supervised Transformers network that can effectively utilize
unlabeled video data. To extract spatio-temporal information, we created local
and global views with varying frame rates. Our self-supervised objective
ensures that features extracted from contrasting views of the same video were
consistent across spatio-temporal domains. Our proposed approach is efficient
in using transformer-based encoders to alleviate the weakly supervised setting
of group activity recognition. By leveraging the benefits of transformer
models, our approach can model long-term relationships along spatio-temporal
dimensions. Our proposed SoGAR method achieved state-of-the-art results on
three group activity recognition benchmarks, namely JRDB-PAR, NBA, and
Volleyball datasets, surpassing the current numbers in terms of F1-score, MCA,
and MPCA metrics.
Related papers
- Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - DECOMPL: Decompositional Learning with Attention Pooling for Group
Activity Recognition from a Single Volleyball Image [3.6144103736375857]
Group Activity Recognition (GAR) aims to detect the activity performed by multiple actors in a scene.
We propose a novel GAR technique for volleyball videos, DECOMPL, which consists of two complementary branches.
In the visual branch, it extracts the features using attention pooling in a selective way.
In the coordinate branch, it considers the current configuration of the actors and extracts spatial information from the box coordinates.
arXiv Detail & Related papers (2023-03-11T16:30:51Z) - SPARTAN: Self-supervised Spatiotemporal Transformers Approach to Group
Activity Recognition [47.3759947287782]
We propose a new, simple, and effective Self-supervised Spatio-temporal Transformers (TAN) approach to Group Activity Recognition (GAR) using unlabeled video data.
arXiv Detail & Related papers (2023-03-06T16:58:27Z) - Efficient Global-Local Memory for Real-time Instrument Segmentation of
Robotic Surgical Video [53.14186293442669]
We identify two important clues for surgical instrument perception, including local temporal dependency from adjacent frames and global semantic correlation in long-range duration.
We propose a novel dual-memory network (DMNet) to relate both global and local-temporal knowledge.
Our method largely outperforms the state-of-the-art works on segmentation accuracy while maintaining a real-time speed.
arXiv Detail & Related papers (2021-09-28T10:10:14Z) - GroupFormer: Group Activity Recognition with Clustered Spatial-Temporal
Transformer [16.988878921451484]
GroupFormer captures spatial-temporal contextual information jointly to augment the individual and group representations.
The proposed framework outperforms state-of-the-art methods on the Volleyball dataset and Collective Activity dataset.
arXiv Detail & Related papers (2021-08-28T11:24:36Z) - Modeling long-term interactions to enhance action recognition [81.09859029964323]
We propose a new approach to under-stand actions in egocentric videos that exploits the semantics of object interactions at both frame and temporal levels.
We use a region-based approach that takes as input a primary region roughly corresponding to the user hands and a set of secondary regions potentially corresponding to the interacting objects.
The proposed approach outperforms the state-of-the-art in terms of action recognition on standard benchmarks.
arXiv Detail & Related papers (2021-04-23T10:08:15Z) - Domain Adaptive Robotic Gesture Recognition with Unsupervised
Kinematic-Visual Data Alignment [60.31418655784291]
We propose a novel unsupervised domain adaptation framework which can simultaneously transfer multi-modality knowledge, i.e., both kinematic and visual data, from simulator to real robot.
It remedies the domain gap with enhanced transferable features by using temporal cues in videos, and inherent correlations in multi-modal towards recognizing gesture.
Results show that our approach recovers the performance with great improvement gains, up to 12.91% in ACC and 20.16% in F1score without using any annotations in real robot.
arXiv Detail & Related papers (2021-03-06T09:10:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.