Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
- URL: http://arxiv.org/abs/2509.04758v1
- Date: Fri, 05 Sep 2025 02:37:01 GMT
- Title: Dynamic Group Detection using VLM-augmented Temporal Groupness Graph
- Authors: Kaname Yokoyama, Chihiro Nakatani, Norimichi Ukita,
- Abstract summary: This paper proposes dynamic human group detection in videos.<n>For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important.<n>Our method outperforms state-of-the-art group detection methods on public datasets.
- Score: 15.43013474885794
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper proposes dynamic human group detection in videos. For detecting complex groups, not only the local appearance features of in-group members but also the global context of the scene are important. Such local and global appearance features in each frame are extracted using a Vision-Language Model (VLM) augmented for group detection in our method. For further improvement, the group structure should be consistent over time. While previous methods are stabilized on the assumption that groups are not changed in a video, our method detects dynamically changing groups by global optimization using a graph with all frames' groupness probabilities estimated by our groupness-augmented CLIP features. Our experimental results demonstrate that our method outperforms state-of-the-art group detection methods on public datasets. Code: https://github.com/irajisamurai/VLM-GroupDetection.git
Related papers
- Unsupervised Feature Selection Through Group Discovery [25.774724891374774]
GroupFS is an end-to-end framework that jointly discovers latent feature groups and selects the most informative groups among them.<n>GroupFS consistently outperforms state-of-the-art unsupervised FS in clustering and selects groups of features that align with meaningful patterns.
arXiv Detail & Related papers (2025-11-12T10:05:03Z) - GroupCoOp: Group-robust Fine-tuning via Group Prompt Learning [57.888537648437115]
Group Context Optimization (GroupCoOp) is a simple and effective debiased fine-tuning algorithm.<n>It enhances the group robustness of fine-tuned vision-language models (VLMs)<n>GroupCoOp achieved the best results on five benchmarks across five CLIP architectures.
arXiv Detail & Related papers (2025-09-28T09:54:30Z) - Prompt-Guided Relational Reasoning for Social Behavior Understanding with Vision Foundation Models [8.36651942320007]
Group Activity Detection (GAD) involves recognizing social groups and their collective behaviors in videos.<n>Vision Foundation Models (VFMs), like DinoV2, offer excellent features, but are pretrained primarily on object-centric data.<n>We introduce Prompt-driven Group Activity Detection (ProGraD) -- a method that bridges this gap through 1) learnable group prompts to guide the VFM attention toward social configurations.
arXiv Detail & Related papers (2025-08-11T13:59:22Z) - Group-CLIP Uncertainty Modeling for Group Re-Identification [0.0]
Group ReID aims to match groups of pedestrians across non-overlapping cameras.<n>Most methods rely on certainty-based models, which consider only the specific group structures in the group images.<n>We propose a novel Group-CLIP UncertaintyModeling (GCUM) approach that adapts group text descriptions to accommodate member and layout variations.
arXiv Detail & Related papers (2025-02-10T13:41:35Z) - Vision Transformer based Random Walk for Group Re-Identification [15.63292108454152]
Group re-identification (re-ID) aims to match groups with the same people under different cameras.
We propose a novel vision transformer based random walk framework for group re-ID.
arXiv Detail & Related papers (2024-10-08T08:41:14Z) - Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition [57.97930719585095]
We introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales.
Our approach is evaluated on various skeleton/language backbones and three large-scale datasets.
The results showcase the universality and superior performance of PURLS, surpassing prior skeleton-based solutions and standard baselines from other domains.
arXiv Detail & Related papers (2024-06-19T08:22:32Z) - Towards Group Robustness in the presence of Partial Group Labels [61.33713547766866]
spurious correlations between input samples and the target labels wrongly direct the neural network predictions.
We propose an algorithm that optimize for the worst-off group assignments from a constraint set.
We show improvements in the minority group's performance while preserving overall aggregate accuracy across groups.
arXiv Detail & Related papers (2022-01-10T22:04:48Z) - Focus on the Common Good: Group Distributional Robustness Follows [47.62596240492509]
This paper proposes a new and simple algorithm that explicitly encourages learning of features that are shared across various groups.
While Group-DRO focuses on groups with worst regularized loss, focusing instead, on groups that enable better performance even on other groups, could lead to learning of shared/common features.
arXiv Detail & Related papers (2021-10-06T09:47:41Z) - Learning Multi-Attention Context Graph for Group-Based Re-Identification [214.84551361855443]
Learning to re-identify or retrieve a group of people across non-overlapped camera systems has important applications in video surveillance.
In this work, we consider employing context information for identifying groups of people, i.e., group re-id.
We propose a novel unified framework based on graph neural networks to simultaneously address the group-based re-id tasks.
arXiv Detail & Related papers (2021-04-29T09:57:47Z) - Learning Spatial Context with Graph Neural Network for Multi-Person Pose
Grouping [71.59494156155309]
Bottom-up approaches for image-based multi-person pose estimation consist of two stages: keypoint detection and grouping.
In this work, we formulate the grouping task as a graph partitioning problem, where we learn the affinity matrix with a Graph Neural Network (GNN)
The learned geometry-based affinity is further fused with appearance-based affinity to achieve robust keypoint association.
arXiv Detail & Related papers (2021-04-06T09:21:14Z) - Overcoming Data Sparsity in Group Recommendation [52.00998276970403]
Group recommender systems should be able to accurately learn not only users' personal preferences but also preference aggregation strategy.
In this paper, we take Bipartite Graphding Model (BGEM), the self-attention mechanism and Graph Convolutional Networks (GCNs) as basic building blocks to learn group and user representations in a unified way.
arXiv Detail & Related papers (2020-10-02T07:11:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.