MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes
- URL: http://arxiv.org/abs/2509.13484v2
- Date: Thu, 18 Sep 2025 14:03:41 GMT
- Title: MINGLE: VLMs for Semantically Complex Region Detection in Urban Scenes
- Authors: Liu Liu, Alexandra Kudaeva, Marco Cipriano, Fatimeh Al Ghannam, Freya Tan, Gerard de Melo, Andres Sevtsuk,
- Abstract summary: Group-level social interactions in public spaces are crucial for urban planning.<n>We introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by interpersonal relations.<n>We propose MINGLE, a modular three-stage pipeline that integrates human detection and depth estimation, VLM-based reasoning to classify pairwise social affiliation, and a lightweight spatial aggregation algorithm to localize socially connected groups.<n>We present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups.
- Score: 49.89767522399176
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Understanding group-level social interactions in public spaces is crucial for urban planning, informing the design of socially vibrant and inclusive environments. Detecting such interactions from images involves interpreting subtle visual cues such as relations, proximity, and co-movement - semantically complex signals that go beyond traditional object detection. To address this challenge, we introduce a social group region detection task, which requires inferring and spatially grounding visual regions defined by abstract interpersonal relations. We propose MINGLE (Modeling INterpersonal Group-Level Engagement), a modular three-stage pipeline that integrates: (1) off-the-shelf human detection and depth estimation, (2) VLM-based reasoning to classify pairwise social affiliation, and (3) a lightweight spatial aggregation algorithm to localize socially connected groups. To support this task and encourage future research, we present a new dataset of 100K urban street-view images annotated with bounding boxes and labels for both individuals and socially interacting groups. The annotations combine human-created labels and outputs from the MINGLE pipeline, ensuring semantic richness and broad coverage of real-world scenarios.
Related papers
- Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection [82.70752567211251]
We propose a part-aware bottom-up group reasoning framework for fine-grained social interaction detection.<n>The proposed method infers social groups and their interactions using body part features and their interpersonal relations.<n>Our model first detects individuals and enhances their features using part-aware cues, and then infers group configuration by associating individuals via similarity-based reasoning.
arXiv Detail & Related papers (2025-11-05T17:33:03Z) - Learning Human-Object Interaction as Groups [52.28258599873394]
GroupHOI is a framework that propagates contextual information in terms of geometric proximity and semantic similarity.<n>It exhibits leading performance on the more challenging Nonverbal Interaction Detection task.
arXiv Detail & Related papers (2025-10-21T07:25:10Z) - What-Meets-Where: Unified Learning of Action and Contact Localization in a New Dataset [6.6946566008924036]
We introduce a novel vision task that simultaneously predicts high-level action semantics and fine-grained body-part contact regions.<n>We present PaIR (Part-aware Interaction Representation), a comprehensive dataset containing 13,979 images that encompass 654 actions, 80 object categories, and 17 body parts.
arXiv Detail & Related papers (2025-08-13T02:06:33Z) - Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Grounding 3D Scene Affordance From Egocentric Interactions [52.5827242925951]
Grounding 3D scene affordance aims to locate interactive regions in 3D environments.
We introduce a novel task: grounding 3D scene affordance from egocentric interactions.
arXiv Detail & Related papers (2024-09-29T10:46:19Z) - Multi-Temporal Relationship Inference in Urban Areas [75.86026742632528]
Finding temporal relationships among locations can benefit a bunch of urban applications, such as dynamic offline advertising and smart public transport planning.
We propose a solution to Trial with a graph learning scheme, which includes a spatially evolving graph neural network (SEENet)
SEConv performs the intra-time aggregation and inter-time propagation to capture the multifaceted spatially evolving contexts from the view of location message passing.
SE-SSL designs time-aware self-supervised learning tasks in a global-local manner with additional evolving constraint to enhance the location representation learning and further handle the relationship sparsity.
arXiv Detail & Related papers (2023-06-15T07:48:32Z) - Monitoring Social-distance in Wide Areas during Pandemics: a Density Map
and Segmentation Approach [0.0]
We propose a new framework for monitoring the social-distance using end-to-end Deep Learning.
Our framework consists in the creation of a new ground truth based on the ground truth density maps.
We show that our framework performs well at providing the zones where people are not following the social-distance even when heavily occluded or far away from one camera.
arXiv Detail & Related papers (2021-04-07T19:26:26Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.