DECOMPL: Decompositional Learning with Attention Pooling for Group
Activity Recognition from a Single Volleyball Image
- URL: http://arxiv.org/abs/2303.06439v1
- Date: Sat, 11 Mar 2023 16:30:51 GMT
- Title: DECOMPL: Decompositional Learning with Attention Pooling for Group
Activity Recognition from a Single Volleyball Image
- Authors: Berker Demirel, Huseyin Ozkan
- Abstract summary: Group Activity Recognition (GAR) aims to detect the activity performed by multiple actors in a scene.
We propose a novel GAR technique for volleyball videos, DECOMPL, which consists of two complementary branches.
In the visual branch, it extracts the features using attention pooling in a selective way.
In the coordinate branch, it considers the current configuration of the actors and extracts spatial information from the box coordinates.
- Score: 3.6144103736375857
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Group Activity Recognition (GAR) aims to detect the activity performed by
multiple actors in a scene. Prior works model the spatio-temporal features
based on the RGB, optical flow or keypoint data types. However, using both the
temporality and these data types altogether increase the computational
complexity significantly. Our hypothesis is that by only using the RGB data
without temporality, the performance can be maintained with a negligible loss
in accuracy. To that end, we propose a novel GAR technique for volleyball
videos, DECOMPL, which consists of two complementary branches. In the visual
branch, it extracts the features using attention pooling in a selective way. In
the coordinate branch, it considers the current configuration of the actors and
extracts the spatial information from the box coordinates. Moreover, we
analyzed the Volleyball dataset that the recent literature is mostly based on,
and realized that its labeling scheme degrades the group concept in the
activities to the level of individual actors. We manually reannotated the
dataset in a systematic manner for emphasizing the group concept. Experimental
results on the Volleyball as well as Collective Activity (from another domain,
i.e., not volleyball) datasets demonstrated the effectiveness of the proposed
model DECOMPL, which delivered the best/second best GAR performance with the
reannotations/original annotations among the comparable state-of-the-art
techniques. Our code, results and new annotations will be made available
through GitHub after the revision process.
Related papers
- Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph [4.075741925017479]
Group Activity Recognition aims to understand collective activities from videos.
Existing solutions rely on the RGB modality, which encounters challenges such as background variations.
We design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity.
arXiv Detail & Related papers (2024-07-28T13:57:03Z) - SoGAR: Self-supervised Spatiotemporal Attention-based Social Group Activity Recognition [45.419756454791674]
This paper introduces a novel approach to Social Group Activity (SoGAR) using Self-supervised Transformers.
Our objective ensures that features extracted from contrasting views were consistent across self-temporal domains.
Our proposed SoGAR method achieved state-of-the-art results on three group activity recognition benchmarks.
arXiv Detail & Related papers (2023-04-27T03:41:15Z) - Knowledge Combination to Learn Rotated Detection Without Rotated
Annotation [53.439096583978504]
Rotated bounding boxes drastically reduce output ambiguity of elongated objects.
Despite the effectiveness, rotated detectors are not widely employed.
We propose a framework that allows the model to predict precise rotated boxes.
arXiv Detail & Related papers (2023-04-05T03:07:36Z) - Simplifying Open-Set Video Domain Adaptation with Contrastive Learning [16.72734794723157]
unsupervised video domain adaptation methods have been proposed to adapt a predictive model from a labelled dataset to an unlabelled dataset.
We address a more realistic scenario, called open-set video domain adaptation (OUVDA), where the target dataset contains "unknown" semantic categories that are not shared with the source.
We propose a video-oriented temporal contrastive loss that enables our method to better cluster the feature space by exploiting the freely available temporal information in video data.
arXiv Detail & Related papers (2023-01-09T13:16:50Z) - Self-Supervised Place Recognition by Refining Temporal and Featural Pseudo Labels from Panoramic Data [16.540900776820084]
We propose a novel framework named TF-VPR that uses temporal neighborhoods and learnable feature neighborhoods to discover unknown spatial neighborhoods.
Our method outperforms self-supervised baselines in recall rate, robustness, and heading diversity.
arXiv Detail & Related papers (2022-08-19T12:59:46Z) - Learning from Temporal Spatial Cubism for Cross-Dataset Skeleton-based
Action Recognition [88.34182299496074]
Action labels are only available on a source dataset, but unavailable on a target dataset in the training stage.
We utilize a self-supervision scheme to reduce the domain shift between two skeleton-based action datasets.
By segmenting and permuting temporal segments or human body parts, we design two self-supervised learning classification tasks.
arXiv Detail & Related papers (2022-07-17T07:05:39Z) - Representing Videos as Discriminative Sub-graphs for Action Recognition [165.54738402505194]
We introduce a new design of sub-graphs to represent and encode theriminative patterns of each action in the videos.
We present MUlti-scale Sub-Earn Ling (MUSLE) framework that novelly builds space-time graphs and clusters into compact sub-graphs on each scale.
arXiv Detail & Related papers (2022-01-11T16:15:25Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Social Adaptive Module for Weakly-supervised Group Activity Recognition [143.68241396839062]
This paper presents a new task named weakly-supervised group activity recognition (GAR)
It differs from conventional GAR tasks in that only video-level labels are available, yet the important persons within each frame are not provided even in the training data.
This eases us to collect and annotate a large-scale NBA dataset and thus raise new challenges to GAR.
arXiv Detail & Related papers (2020-07-18T16:40:55Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z) - Pairwise Similarity Knowledge Transfer for Weakly Supervised Object
Localization [53.99850033746663]
We study the problem of learning localization model on target classes with weakly supervised image labels.
In this work, we argue that learning only an objectness function is a weak form of knowledge transfer.
Experiments on the COCO and ILSVRC 2013 detection datasets show that the performance of the localization model improves significantly with the inclusion of pairwise similarity function.
arXiv Detail & Related papers (2020-03-18T17:53:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.