Related papers: Contextualized Representation Learning for Effective Human-Object Interaction Detection

Contextualized Representation Learning for Effective Human-Object Interaction Detection

URL: http://arxiv.org/abs/2509.12784v2
Date: Fri, 03 Oct 2025 06:33:53 GMT
Title: Contextualized Representation Learning for Effective Human-Object Interaction Detection
Authors: Zhehao Li, Yucheng Qian, Chong Wang, Yinghao Lu, Zhihao Yang, Jiafei Wu,
Abstract summary: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions.<n>We introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts.<n>Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios.
Score: 17.242400169885453
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Human-Object Interaction (HOI) detection aims to simultaneously localize human-object pairs and recognize their interactions. While recent two-stage approaches have made significant progress, they still face challenges due to incomplete context modeling. In this work, we introduce a Contextualized Representation Learning that integrates both affordance-guided reasoning and contextual prompts with visual cues to better capture complex interactions. We enhance the conventional HOI detection framework by expanding it beyond simple human-object pairs to include multivariate relationships involving auxiliary entities like tools. Specifically, we explicitly model the functional role (affordance) of these auxiliary objects through triplet structures <human, tool, object>. This enables our model to identify tool-dependent interactions such as 'filling'. Furthermore, the learnable prompt is enriched with instance categories and subsequently integrated with contextual visual features using an attention mechanism. This process aligns language with image content at both global and regional levels. These contextualized representations equip the model with enriched relational cues for more reliable reasoning over complex, context-dependent interactions. Our proposed method demonstrates superior performance on both the HICO-Det and V-COCO datasets in most scenarios. The source code is available at https://github.com/lzzhhh1019/CRL.

Related papers

Learning Human-Object Interaction as Groups [52.28258599873394]
GroupHOI is a framework that propagates contextual information in terms of geometric proximity and semantic similarity.<n>It exhibits leading performance on the more challenging Nonverbal Interaction Detection task.
arXiv Detail & Related papers (2025-10-21T07:25:10Z)
Dynamic Scoring with Enhanced Semantics for Training-Free Human-Object Interaction Detection [51.52749744031413]
Human-Object Interaction (HOI) detection aims to identify humans and objects within images and interpret their interactions.<n>Existing HOI methods rely heavily on large datasets with manual annotations to learn interactions from visual cues.<n>We propose a novel training-free HOI detection framework for Dynamic Scoring with enhanced semantics.
arXiv Detail & Related papers (2025-07-23T12:30:19Z)
Geometric Visual Fusion Graph Neural Networks for Multi-Person Human-Object Interaction Recognition in Videos [14.198003271084799]
Human-Object Interaction (HOI) recognition in videos requires understanding both visual patterns and geometric relationships as they evolve over time.<n>We propose the Geometric Visual Fusion Graph Neural Network (GeoVis-GNN), which uses dual-attention feature fusion combined with interdependent entity graph learning.<n>To advance HOI recognition to real-world scenarios, we introduce the Concurrent Partial Interaction dataset.
arXiv Detail & Related papers (2025-06-03T22:51:44Z)
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning [27.511627003202538]
Traditional scene graphs primarily focus on spatial relationships, limiting vision-language models' (VLMs) ability to reason about complex interactions in visual scenes.<n>This paper addresses two key challenges: (1) conventional detection-to-construction methods produce unfocused, contextually irrelevant relationship sets, and (2) existing approaches fail to form persistent memories for generalizing interaction reasoning to new scenes.<n>We propose Interaction-augmented Scene Graph Reasoning (ISGR), a framework that enhances VLMs' interactional reasoning through three complementary components.
arXiv Detail & Related papers (2025-05-14T04:04:23Z)
Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues. Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z)
Towards Zero-shot Human-Object Interaction Detection via Vision-Language Integration [14.678931157058363]
We propose a novel framework, termed Knowledge Integration to HOI (KI2HOI), that effectively integrates the knowledge of visual-language model to improve zero-shot HOI detection. We develop an effective additive self-attention mechanism to generate more comprehensive visual representations. Our model outperforms the previous methods in various zero-shot and full-supervised settings.
arXiv Detail & Related papers (2024-03-12T02:07:23Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Detecting Any Human-Object Interaction Relationship: Universal HOI Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs) Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image. For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.