Detector-Free Weakly Supervised Group Activity Recognition
- URL: http://arxiv.org/abs/2204.02139v1
- Date: Tue, 5 Apr 2022 12:05:04 GMT
- Title: Detector-Free Weakly Supervised Group Activity Recognition
- Authors: Dongkeun Kim, Jinsung Lee, Minsu Cho, Suha Kwak
- Abstract summary: Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a video.
We propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detectors.
Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism.
- Score: 41.344689949264335
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Group activity recognition is the task of understanding the activity
conducted by a group of people as a whole in a multi-person video. Existing
models for this task are often impractical in that they demand ground-truth
bounding box labels of actors even in testing or rely on off-the-shelf object
detectors. Motivated by this, we propose a novel model for group activity
recognition that depends neither on bounding box labels nor on object detector.
Our model based on Transformer localizes and encodes partial contexts of a
group activity by leveraging the attention mechanism, and represents a video
clip as a set of partial context embeddings. The embedding vectors are then
aggregated to form a single group representation that reflects the entire
context of an activity while capturing temporal evolution of each partial
context. Our method achieves outstanding performance on two benchmarks,
Volleyball and NBA datasets, surpassing not only the state of the art trained
with the same level of supervision, but also some of existing models relying on
stronger supervision.
Related papers
- Group Activity Recognition using Unreliable Tracked Pose [8.592249538742527]
Group activity recognition in video is a complex task due to the need for a model to recognise the actions of all individuals in the video.
We introduce an innovative deep learning-based group activity recognition approach called Rendered Pose based Group Activity Recognition System (RePGARS)
arXiv Detail & Related papers (2024-01-06T17:36:13Z) - Query by Activity Video in the Wild [52.42177539947216]
In current query-by-activity-video literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding.
We propose a visual-semantic embedding network that explicitly deals with the imbalanced scenario for activity retrieval.
arXiv Detail & Related papers (2023-11-23T10:26:36Z) - Actor-agnostic Multi-label Action Recognition with Multi-modal Query [42.38571663534819]
Existing action recognition methods are typically actor-specific.
This requires actor-specific pose estimation (e.g., humans vs. animals)
We propose a new approach called 'actor-agnostic multi-modal multi-label action recognition'
arXiv Detail & Related papers (2023-07-20T10:53:12Z) - Self-supervised Pretraining with Classification Labels for Temporal
Activity Detection [54.366236719520565]
Temporal Activity Detection aims to predict activity classes per frame.
Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited.
This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
arXiv Detail & Related papers (2021-11-26T18:59:28Z) - Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input.
We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z) - Unsupervised Action Segmentation with Self-supervised Feature Learning
and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label.
In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos.
We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z) - Learning Group Activities from Skeletons without Individual Action
Labels [32.60526967706986]
We show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level.
Our experiments show that models trained without individual action supervision perform poorly.
Our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.
arXiv Detail & Related papers (2021-05-14T10:31:32Z) - MIST: Multiple Instance Self-Training Framework for Video Anomaly
Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations.
MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder.
Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z) - DyStaB: Unsupervised Object Segmentation via Dynamic-Static
Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole.
Our method first partitions the motion field by minimizing the mutual information between segments.
It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z) - FineGym: A Hierarchical Video Dataset for Fine-grained Action
Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos.
It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy.
This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.