Related papers: Detector-Free Weakly Supervised Group Activity Recognition

Detector-Free Weakly Supervised Group Activity Recognition

URL: http://arxiv.org/abs/2204.02139v1
Date: Tue, 5 Apr 2022 12:05:04 GMT
Title: Detector-Free Weakly Supervised Group Activity Recognition
Authors: Dongkeun Kim, Jinsung Lee, Minsu Cho, Suha Kwak
Abstract summary: Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a video. We propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detectors. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism.
Score: 41.344689949264335
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings. The embedding vectors are then aggregated to form a single group representation that reflects the entire context of an activity while capturing temporal evolution of each partial context. Our method achieves outstanding performance on two benchmarks, Volleyball and NBA datasets, surpassing not only the state of the art trained with the same level of supervision, but also some of existing models relying on stronger supervision.

Related papers

Group Activity Recognition using Unreliable Tracked Pose [8.592249538742527]
Group activity recognition in video is a complex task due to the need for a model to recognise the actions of all individuals in the video. We introduce an innovative deep learning-based group activity recognition approach called Rendered Pose based Group Activity Recognition System (RePGARS)
arXiv Detail & Related papers (2024-01-06T17:36:13Z)
Query by Activity Video in the Wild [52.42177539947216]
In current query-by-activity-video literature, a common assumption is that all activities have sufficient labelled examples when learning an embedding. We propose a visual-semantic embedding network that explicitly deals with the imbalanced scenario for activity retrieval.
arXiv Detail & Related papers (2023-11-23T10:26:36Z)
Actor-agnostic Multi-label Action Recognition with Multi-modal Query [42.38571663534819]
Existing action recognition methods are typically actor-specific. This requires actor-specific pose estimation (e.g., humans vs. animals) We propose a new approach called 'actor-agnostic multi-modal multi-label action recognition'
arXiv Detail & Related papers (2023-07-20T10:53:12Z)
Self-supervised Pretraining with Classification Labels for Temporal Activity Detection [54.366236719520565]
Temporal Activity Detection aims to predict activity classes per frame. Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. This work proposes a novel self-supervised pretraining method for detection leveraging classification labels.
arXiv Detail & Related papers (2021-11-26T18:59:28Z)
Temporal Action Segmentation with High-level Complex Activity Labels [29.17792724210746]
We learn the action segments taking only the high-level activity labels as input. We propose a novel action discovery framework that automatically discovers constituent actions in videos.
arXiv Detail & Related papers (2021-08-15T09:50:42Z)
Unsupervised Action Segmentation with Self-supervised Feature Learning and Co-occurrence Parsing [32.66011849112014]
temporal action segmentation is a task to classify each frame in the video with an action label. In this work we explore a self-supervised method that operates on a corpus of unlabeled videos and predicts a likely set of temporal segments across the videos. We develop CAP, a novel co-occurrence action parsing algorithm that can not only capture the correlation among sub-actions underlying the structure of activities, but also estimate the temporal trajectory of the sub-actions in an accurate and general way.
arXiv Detail & Related papers (2021-05-29T00:29:40Z)
Learning Group Activities from Skeletons without Individual Action Labels [32.60526967706986]
We show that using only skeletal data we can train a state-of-the art end-to-end system using only group activity labels at the sequence level. Our experiments show that models trained without individual action supervision perform poorly. Our carefully designed lean pose only architecture shows highly competitive results versus more complex multimodal approaches even in the self-supervised variant.
arXiv Detail & Related papers (2021-05-14T10:31:32Z)
MIST: Multiple Instance Self-Training Framework for Video Anomaly Detection [76.80153360498797]
We develop a multiple instance self-training framework (MIST) to efficiently refine task-specific discriminative representations. MIST is composed of 1) a multiple instance pseudo label generator, which adapts a sparse continuous sampling strategy to produce more reliable clip-level pseudo labels, and 2) a self-guided attention boosted feature encoder. Our method performs comparably to or even better than existing supervised and weakly supervised methods, specifically obtaining a frame-level AUC 94.83% on ShanghaiTech.
arXiv Detail & Related papers (2021-04-04T15:47:14Z)
DyStaB: Unsupervised Object Segmentation via Dynamic-Static Bootstrapping [72.84991726271024]
We describe an unsupervised method to detect and segment portions of images of live scenes that are seen moving as a coherent whole. Our method first partitions the motion field by minimizing the mutual information between segments. It uses the segments to learn object models that can be used for detection in a static image.
arXiv Detail & Related papers (2020-08-16T22:05:13Z)
FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding [118.32912239230272]
FineGym is a new action recognition dataset built on top of gymnastic videos. It provides temporal annotations at both action and sub-action levels with a three-level semantic hierarchy. This new level of granularity presents significant challenges for action recognition.
arXiv Detail & Related papers (2020-04-14T17:55:21Z)

This list is automatically generated from the titles and abstracts of the papers in this site.