Related papers: Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models

Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models

URL: http://arxiv.org/abs/2508.07996v1
Date: Mon, 11 Aug 2025 13:59:22 GMT
Title: Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models
Authors: Thinesh Thiyakesan Ponbagavathi, Chengzheng Yang, Alina Roitberg,
Abstract summary: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos.<n>Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data.<n>We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations.
Score: 8.36651942320007
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos. Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data and remain underexplored for modeling group dynamics. While they are a promising alternative to highly task-specific GAD architectures that require full fine-tuning, our initial investigation reveals that simply swapping CNN backbones used in these methods with VFMs brings little gain, underscoring the need for structured, group-aware reasoning on top. We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations, and 2) a lightweight two-layer GroupContext Transformer that infers actor-group associations and collective behavior. We evaluate our approach on two recent GAD benchmarks: Cafe, which features multiple concurrent social groups, and Social-CAD, which focuses on single-group interactions. While we surpass state-of-the-art in both settings, our method is especially effective in complex multi-group scenarios, where we yield a gain of 6.5\% (Group mAP\@1.0) and 8.2\% (Group mAP\@0.5) using only 10M trainable parameters. Furthermore, our experiments reveal that ProGraD produces interpretable attention maps, offering insights into actor-group reasoning. Code and models will be released.

Related papers

Beyond the Individual: Introducing Group Intention Forecasting with SHOT Dataset [32.9983492637077]
Group intention represents shared goals emerging through the actions of multiple individuals.<n>Group Intention Forecasting (GIF) is a novel task that forecasts when group intentions will occur by analyzing individual actions and interactions.<n>SHOT is the first large-scale dataset for GIF, consisting of 1,979 basketball video clips captured from 5 camera views.<n> GIFT is a framework that extracts fine-grained individual features and models evolving group dynamics to forecast intention emergence.
arXiv Detail & Related papers (2025-09-25T03:28:01Z)
Hierarchical Multi-Graphs Learning for Robust Group Re-Identification [28.79580663619657]
Group Re-identification (G-ReID) faces greater complexity than individual Re-identification (ReID)<n>Prior graph-based approaches have aimed to capture these dynamics by modeling the group as a single topological structure.<n>We introduce a Hierarchical Multi-Graphs Learning framework to address these challenges.
arXiv Detail & Related papers (2024-12-25T03:33:43Z)
Towards More Practical Group Activity Detection: A New Benchmark and Model [61.39427407758131]
Group activity detection (GAD) is the task of identifying members of each group and classifying the activity of the group at the same time in a video. We present a new dataset, dubbed Caf'e, which presents more practical scenarios and metrics. We also propose a new GAD model that deals with an unknown number of groups and latent group members efficiently and effectively.
arXiv Detail & Related papers (2023-12-05T16:48:17Z)
USER: Unified Semantic Enhancement with Momentum Contrast for Image-Text Retrieval [115.28586222748478]
Image-Text Retrieval (ITR) aims at searching for the target instances that are semantically relevant to the given query from the other modality. Existing approaches typically suffer from two major limitations.
arXiv Detail & Related papers (2023-01-17T12:42:58Z)
Ranking-based Group Identification via Factorized Attention on Social Tripartite Graph [68.08590487960475]
We propose a novel GNN-based framework named Contextualized Factorized Attention for Group identification (CFAG) We devise tripartite graph convolution layers to aggregate information from different types of neighborhoods among users, groups, and items. To cope with the data sparsity issue, we devise a novel propagation augmentation layer, which is based on our proposed factorized attention mechanism.
arXiv Detail & Related papers (2022-11-02T01:42:20Z)
Graph Neural Netwrok with Interaction Pattern for Group Recommendation [1.066048003460524]
We propose the model GIP4GR (Graph Neural Network with Interaction Pattern For Group Recommendation) Specifically, our model use the graph neural network framework with powerful representation capabilities to represent the interaction between group-user-items in the topological structure of the graph. We conducted a lot of experiments on two real-world datasets to illustrate the superior performance of our model.
arXiv Detail & Related papers (2021-09-21T13:42:46Z)
Double-Scale Self-Supervised Hypergraph Learning for Group Recommendation [35.841350982832545]
Group recommendation suffers seriously from the problem of data sparsity. We propose a self-supervised hypergraph learning framework for group recommendation to achieve two goals.
arXiv Detail & Related papers (2021-09-09T12:19:49Z)
Learning Multi-Attention Context Graph for Group-Based Re-Identification [214.84551361855443]
Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance. In this work, we consider employing context information for identifying groups of people, i.e., group re-id. We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks.
arXiv Detail & Related papers (2021-04-29T09:57:47Z)
CoADNet: Collaborative Aggregation-and-Distribution Networks for Co-Salient Object Detection [91.91911418421086]
Co-Salient Object Detection (CoSOD) aims at discovering salient objects that repeatedly appear in a given query group containing two or more relevant images. One challenging issue is how to effectively capture co-saliency cues by modeling and exploiting inter-image relationships. We present an end-to-end collaborative aggregation-and-distribution network (CoADNet) to capture both salient and repetitive visual patterns from multiple images.
arXiv Detail & Related papers (2020-11-10T04:28:11Z)
Overcoming Data Sparsity in Group Recommendation [52.00998276970403]
Group recommender systems should be able to accurately learn not only users' personal preferences but also preference aggregation strategy. In this paper, we take Bipartite Graphding Model (BGEM), the self-attention mechanism and Graph Convolutional Networks (GCNs) as basic building blocks to learn group and user representations in a unified way.
arXiv Detail & Related papers (2020-10-02T07:11:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.