Group Activity Recognition using Unreliable Tracked Pose
- URL: http://arxiv.org/abs/2401.03262v1
- Date: Sat, 6 Jan 2024 17:36:13 GMT
- Title: Group Activity Recognition using Unreliable Tracked Pose
- Authors: Haritha Thilakarathne, Aiden Nibali, Zhen He, Stuart Morgan
- Abstract summary: Group activity recognition in video is a complex task due to the need for a model to recognise the actions of all individuals in the video.
We introduce an innovative deep learning-based group activity recognition approach called Rendered Pose based Group Activity Recognition System (RePGARS)
- Score: 8.592249538742527
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group activity recognition in video is a complex task due to the need for a
model to recognise the actions of all individuals in the video and their
complex interactions. Recent studies propose that optimal performance is
achieved by individually tracking each person and subsequently inputting the
sequence of poses or cropped images/optical flow into a model. This helps the
model to recognise what actions each person is performing before they are
merged to arrive at the group action class. However, all previous models are
highly reliant on high quality tracking and have only been evaluated using
ground truth tracking information. In practice it is almost impossible to
achieve highly reliable tracking information for all individuals in a group
activity video. We introduce an innovative deep learning-based group activity
recognition approach called Rendered Pose based Group Activity Recognition
System (RePGARS) which is designed to be tolerant of unreliable tracking and
pose information. Experimental results confirm that RePGARS outperforms all
existing group activity recognition algorithms tested which do not use ground
truth detection and tracking information.
Related papers
- VicKAM: Visual Conceptual Knowledge Guided Action Map for Weakly Supervised Group Activity Recognition [14.701516591822358]
Existing weakly supervised group activity recognition methods rely on object detectors or attention mechanisms to capture key areas automatically.
We propose a novel framework named Visual Conceptual Knowledge Guided Action Map (VicKAM)
VicKAM effectively captures the locations of individual actions and integrates them with action semantics for weakly supervised group activity recognition.
arXiv Detail & Related papers (2025-02-14T07:49:06Z) - Towards More Practical Group Activity Detection: A New Benchmark and Model [61.39427407758131]
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video.
We present a new dataset, dubbed Caf'e, which presents more practical scenarios and metrics.
We also propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively.
arXiv Detail & Related papers (2023-12-05T16:48:17Z) - Single-Shot and Multi-Shot Feature Learning for Multi-Object Tracking [55.13878429987136]
We propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets.
Our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
arXiv Detail & Related papers (2023-11-17T08:17:49Z) - ALP: Action-Aware Embodied Learning for Perception [60.64801970249279]
We introduce Action-Aware Embodied Learning for Perception (ALP)
ALP incorporates action information into representation learning through a combination of optimizing a reinforcement learning policy and an inverse dynamics prediction objective.
We show that ALP outperforms existing baselines in several downstream perception tasks.
arXiv Detail & Related papers (2023-06-16T21:51:04Z) - Multi-dataset Training of Transformers for Robust Action Recognition [75.5695991766902]
We study the task of robust feature representations, aiming to generalize well on multiple datasets for action recognition.
Here, we propose a novel multi-dataset training paradigm, MultiTrain, with the design of two new loss terms, namely informative loss and projection loss.
We verify the effectiveness of our method on five challenging datasets, Kinetics-400, Kinetics-700, Moments-in-Time, Activitynet and Something-something-v2.
arXiv Detail & Related papers (2022-09-26T01:30:43Z) - Detector-Free Weakly Supervised Group Activity Recognition [41.344689949264335]
Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a video.
We propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detectors.
Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism.
arXiv Detail & Related papers (2022-04-05T12:05:04Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - Pose is all you need: The pose only group activity recognition system
(POGARS) [7.876115370275732]
We introduce a novel deep learning based group activity recognition approach called Pose Only Group Activity Recognition System (POGARS)
POGARS uses 1D CNNs to learn dynamics of individuals involved in group activity and forgo learning from pixel data.
Experimental results confirm that POGARS achieves highly competitive results compared to state-of-the-art methods on a widely used public volleyball dataset.
arXiv Detail & Related papers (2021-08-09T17:16:04Z) - Learning Group Activities from Skeletons without Individual Action
Labels [32.60526967706986]
We show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level.
Our experiments show that models trained without individual action supervision perform poorly.
Our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.
arXiv Detail & Related papers (2021-05-14T10:31:32Z) - Learning View-Disentangled Human Pose Representation by Contrastive
Cross-View Mutual Information Maximization [33.36330493757669]
We introduce a novel representation learning method to disentangle pose-dependent as well as view-dependent factors from 2D human poses.
The method trains a network using cross-view mutual information (CV-MIM) which maximizes mutual information of the same pose performed from different viewpoints.
CV-MIM outperforms other competing methods by a large margin in the single-shot cross-view setting.
arXiv Detail & Related papers (2020-12-02T18:55:35Z) - ZSTAD: Zero-Shot Temporal Activity Detection [107.63759089583382]
We propose a novel task setting called zero-shot temporal activity detection (ZSTAD), where activities that have never been seen in training can still be detected.
We design an end-to-end deep network based on R-C3D as the architecture for this solution.
Experiments on both the THUMOS14 and the Charades datasets show promising performance in terms of detecting unseen activities.
arXiv Detail & Related papers (2020-03-12T02:40:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.