Related papers: Computer Vision based group activity detection and action spotting

Computer Vision based group activity detection and action spotting

URL: http://arxiv.org/abs/2511.13315v1
Date: Mon, 17 Nov 2025 12:52:22 GMT
Title: Computer Vision based group activity detection and action spotting
Authors: Narthana Sivalingam, Santhirarajah Sivasthigan, Thamayanthi Mahendranathan, G. M. R. I. Godaliyadda, M. P. B. Ekanayake, H. M. V. R. Herath,
Abstract summary: Group activity detection in multi-person scenes is challenging due to complex human interactions and variations in appearance over time.<n>This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning.
Score: 0.0
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Group activity detection in multi-person scenes is challenging due to complex human interactions, occlusions, and variations in appearance over time. This work presents a computer vision based framework for group activity recognition and action spotting using a combination of deep learning models and graph based relational reasoning. The system first applies Mask R-CNN to obtain accurate actor localization through bounding boxes and instance masks. Multiple backbone networks, including Inception V3, MobileNet, and VGG16, are used to extract feature maps, and RoIAlign is applied to preserve spatial alignment when generating actor specific features. The mask information is then fused with the feature maps to obtain refined masked feature representations for each actor. To model interactions between individuals, we construct Actor Relation Graphs that encode appearance similarity and positional relations using methods such as normalized cross correlation, sum of absolute differences, and dot product. Graph Convolutional Networks operate on these graphs to reason about relationships and predict both individual actions and group level activities. Experiments on the Collective Activity dataset demonstrate that the combination of mask based feature refinement, robust similarity search, and graph neural network reasoning leads to improved recognition performance across both crowded and non crowded scenarios. This approach highlights the potential of integrating segmentation, feature extraction, and relational graph reasoning for complex video understanding tasks.

Related papers

Learning Human-Object Interaction as Groups [52.28258599873394]
GroupHOI is a framework that propagates contextual information in terms of geometric proximity and semantic similarity.<n>It exhibits leading performance on the more challenging Nonverbal Interaction Detection task.
arXiv Detail & Related papers (2025-10-21T07:25:10Z)
Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction. The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph. We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z)
Skeleton-based Group Activity Recognition via Spatial-Temporal Panoramic Graph [4.075741925017479]
Group Activity Recognition aims to understand collective activities from videos.<n>Existing solutions rely on the RGB modality, which encounters challenges such as background variations.<n>We design a panoramic graph that incorporates multi-person skeletons and objects to encapsulate group activity.
arXiv Detail & Related papers (2024-07-28T13:57:03Z)
Skeletal Human Action Recognition using Hybrid Attention based Graph Convolutional Network [3.261599248682793]
We propose a new adaptive spatial attention layer that extends local attention map to global based on relative distance and relative angle information. We design a new initial graph adjacency matrix that connects head, hands and feet, which shows visible improvement in terms of action recognition accuracy. The proposed model is evaluated on two large-scale and challenging datasets in the field of human activities in daily life.
arXiv Detail & Related papers (2022-07-12T12:22:21Z)
Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. Our network deeply embeds cross-image feature correlation in multiple layers of the feature network. Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z)
Spot What Matters: Learning Context Using Graph Convolutional Networks for Weakly-Supervised Action Detection [0.0]
We introduce an architecture based on self-attention and Convolutional Networks to improve human action detection in video. Our model aids explainability by visualizing the learned context as an attention map, even for actions and objects unseen during training. Experimental results show that our contextualized approach outperforms a baseline action detection approach by more than 2 points in Video-mAP.
arXiv Detail & Related papers (2021-07-28T21:37:18Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
Learning Spatial Context with Graph Neural Network for Multi-Person Pose Grouping [71.59494156155309]
Bottom-up approaches for image-based multi-person pose estimation consist of two stages: keypoint detection and grouping. In this work, we formulate the grouping task as a graph partitioning problem, where we learn the affinity matrix with a Graph Neural Network (GNN) The learned geometry-based affinity is further fused with appearance-based affinity to achieve robust keypoint association.
arXiv Detail & Related papers (2021-04-06T09:21:14Z)
CoADNet: Collaborative Aggregation-and-Distribution Networks for Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images. One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships. We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z)
Improved Actor Relation Graph based Group Activity Recognition [0.0]
The detailed description of human actions and group activities is essential information, which can be used in real-time CCTV video surveillance, health care, sports video analysis, etc. This study proposes a video understanding method that mainly focused on group activity recognition by learning the pair-wise actor appearance similarity and actor positions.
arXiv Detail & Related papers (2020-10-24T19:46:49Z)
ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.