Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
- URL: http://arxiv.org/abs/2511.12606v1
- Date: Sun, 16 Nov 2025 14:04:12 GMT
- Title: Pixels or Positions? Benchmarking Modalities in Group Activity Recognition
- Authors: Drishya Karki, Merey Ramazanova, Anthony Cioppa, Silvio Giancola, Bernard Ghanem,
- Abstract summary: Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports.<n>Other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored.<n>No standardized benchmark exists that aligns broadcast video and tracking data for the same group activities.
- Score: 62.18221200080195
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Group Activity Recognition (GAR) is well studied on the video modality for surveillance and indoor team sports (e.g., volleyball, basketball). Yet, other modalities such as agent positions and trajectories over time, i.e. tracking, remain comparatively under-explored despite being compact, agent-centric signals that explicitly encode spatial interactions. Understanding whether pixel (video) or position (tracking) modalities leads to better group activity recognition is therefore important to drive further research on the topic. However, no standardized benchmark currently exists that aligns broadcast video and tracking data for the same group activities, leading to a lack of apples-to-apples comparison between these modalities for GAR. In this work, we introduce SoccerNet-GAR, a multimodal dataset built from the $64$ matches of the football World Cup 2022. Specifically, the broadcast videos and player tracking modalities for $94{,}285$ group activities are synchronized and annotated with $10$ categories. Furthermore, we define a unified evaluation protocol to benchmark two strong unimodal approaches: (i) a competitive video-based classifiers and (ii) a tracking-based classifiers leveraging graph neural networks. In particular, our novel role-aware graph architecture for tracking-based GAR directly encodes tactical structure through positional edges and temporal attention. Our tracking model achieves $67.2\%$ balanced accuracy compared to $58.1\%$ for the best video baseline, while training $4.25 \times$ faster with $438 \times$ fewer parameters ($197K$ \vs $86.3M$). This study provides new insights into the relative strengths of pixels and positions for group activity recognition. Overall, it highlights the importance of modality choice and role-aware modeling for GAR.
Related papers
- Towards Universal Modal Tracking with Online Dense Temporal Token Learning [66.83607018706519]
We propose a universal video-level modality-awareness tracking model with online dense temporal token learning.<n>We expand the model's inputs to a video sequence level, aiming to see a richer video context from a near-global perspective.
arXiv Detail & Related papers (2025-07-27T08:47:42Z) - Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking [52.04679257903805]
Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks.
Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks.
arXiv Detail & Related papers (2024-07-19T07:48:45Z) - Robot Instance Segmentation with Few Annotations for Grasping [10.005879464111915]
We propose a novel framework that combines Semi-Supervised Learning (SSL) with Learning Through Interaction (LTI)<n>Our approach exploits partially annotated data through self-supervision and incorporates temporal context using pseudo-sequences generated from unlabeled still images.<n>We validate our method on two common benchmarks, ARMBench mix-object-tote and OCID, where it achieves state-of-the-art performance.
arXiv Detail & Related papers (2024-07-01T13:58:32Z) - Segment Anything Meets Point Tracking [116.44931239508578]
This paper presents a novel method for point-centric interactive video segmentation, empowered by SAM and long-term point tracking.
We highlight the merits of point-based tracking through direct evaluation on the zero-shot open-world Unidentified Video Objects (UVO) benchmark.
Our experiments on popular video object segmentation and multi-object segmentation tracking benchmarks, including DAVIS, YouTube-VOS, and BDD100K, suggest that a point-based segmentation tracker yields better zero-shot performance and efficient interactions.
arXiv Detail & Related papers (2023-07-03T17:58:01Z) - Dissecting Multimodality in VideoQA Transformer Models by Impairing Modality Fusion [54.33764537135906]
VideoQA Transformer models demonstrate competitive performance on standard benchmarks.
Do these models capture the rich multimodal structures and dynamics from video and text jointly?
Are they achieving high scores by exploiting biases and spurious features?
arXiv Detail & Related papers (2023-06-15T06:45:46Z) - DECOMPL: Decompositional Learning with Attention Pooling for Group
Activity Recognition from a Single Volleyball Image [3.6144103736375857]
Group Activity Recognition (GAR) aims to detect the activity performed by multiple actors in a scene.
We propose a novel GAR technique for volleyball videos, DECOMPL, which consists of two complementary branches.
In the visual branch, it extracts the features using attention pooling in a selective way.
In the coordinate branch, it considers the current configuration of the actors and extracts spatial information from the box coordinates.
arXiv Detail & Related papers (2023-03-11T16:30:51Z) - A Graph-Based Method for Soccer Action Spotting Using Unsupervised
Player Classification [75.93186954061943]
Action spotting involves understanding the dynamics of the game, the complexity of events, and the variation of video sequences.
In this work, we focus on the former by (a) identifying and representing the players, referees, and goalkeepers as nodes in a graph, and by (b) modeling their temporal interactions as sequences of graphs.
For the player identification task, our method obtains an overall performance of 57.83% average-mAP by combining it with other modalities.
arXiv Detail & Related papers (2022-11-22T15:23:53Z) - COMPOSER: Compositional Learning of Group Activity in Videos [33.526331969279106]
Group Activity Recognition (GAR) detects the activity performed by a group of actors in a short video clip.
We propose COMPOSER, a Multiscale Transformer based architecture that performs attention-based reasoning over tokens at each scale.
COMPOSER achieves a new state-of-the-art 94.5% accuracy with the keypoint-only modality.
arXiv Detail & Related papers (2021-12-11T01:25:46Z) - Inconsistent Few-Shot Relation Classification via Cross-Attentional
Prototype Networks with Contrastive Learning [16.128652726698522]
We propose Prototype Network-based cross-attention contrastive learning (ProtoCACL) to capture the rich mutual interactions between the support set and query set.
Experimental results demonstrate that our ProtoCACL can outperform the state-of-the-art baseline model under both inconsistent $K$ and inconsistent $N$ settings.
arXiv Detail & Related papers (2021-10-13T07:47:13Z) - Temporally-Weighted Hierarchical Clustering for Unsupervised Action
Segmentation [96.67525775629444]
Action segmentation refers to inferring boundaries of semantically consistent visual concepts in videos.
We present a fully automatic and unsupervised approach for segmenting actions in a video that does not require any training.
Our proposal is an effective temporally-weighted hierarchical clustering algorithm that can group semantically consistent frames of the video.
arXiv Detail & Related papers (2021-03-20T23:30:01Z) - Improved Soccer Action Spotting using both Audio and Video Streams [3.4376560669160394]
We propose a study on combining audio and video information at different stages of deep neural network architectures.
We used the SoccerNet benchmark dataset, which contains annotated events for 500 soccer game videos from the Big Five European leagues.
We observed an average absolute improvement of the mean Average Precision (mAP) metric of $7.43%$ for the action classification task and of $4.19%$ for the action spotting task.
arXiv Detail & Related papers (2020-11-09T09:12:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.