Related papers: Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

Explicit Multimodal Graph Modeling for Human-Object Interaction Detection

URL: http://arxiv.org/abs/2509.12554v1
Date: Tue, 16 Sep 2025 01:17:49 GMT
Title: Explicit Multimodal Graph Modeling for Human-Object Interaction Detection
Authors: Wenxuan Ji, Haichao Shi, Xiao-Yu zhang,
Abstract summary: Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs.<n>We propose textbfMultimodal textbfGraph textbfNetwork textbfModeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection.
Score: 11.15526365654911
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based methods have recently become the prevailing approach for Human-Object Interaction (HOI) detection. However, the Transformer architecture does not explicitly model the relational structures inherent in HOI detection, which impedes the recognition of interactions. In contrast, Graph Neural Networks (GNNs) are inherently better suited for this task, as they explicitly model the relationships between human-object pairs. Therefore, in this paper, we propose \textbf{M}ultimodal \textbf{G}raph \textbf{N}etwork \textbf{M}odeling (MGNM) that leverages GNN-based relational structures to enhance HOI detection. Specifically, we design a multimodal graph network framework that explicitly models the HOI task in a four-stage graph structure. Furthermore, we introduce a multi-level feature interaction mechanism within our graph network. This mechanism leverages multi-level vision and language features to enhance information propagation across human-object pairs. Consequently, our proposed MGNM achieves state-of-the-art performance on two widely used benchmarks: HICO-DET and V-COCO. Moreover, when integrated with a more advanced object detector, our method demonstrates a significant performance gain and maintains an effective balance between rare and non-rare classes.

Related papers

THeGAU: Type-Aware Heterogeneous Graph Autoencoder and Augmentation [16.50144638827504]
Heterogeneous Graph Neural Networks (HGNNs) are effective for modeling Heterogeneous Information Networks (HINs)<n>HGNNs often suffer from type information loss and structural noise, limiting their representational fidelity and generalization.<n>We propose THeGAU, a model-agnostic framework that combines a type-aware graph autoencoder with guided graph augmentation to improve node classification.
arXiv Detail & Related papers (2025-12-11T12:30:42Z)
Hypergraph Neural Network with State Space Models for Node Classification [0.0]
We propose a novel hypergraph neural network with state space model (HGMN)<n>HGMN effectively integrates role-aware representations into GNNs and the state space model.<n>The model achieves significant performance improvements on node classification tasks compared to state-of-the-art GNN methods.
arXiv Detail & Related papers (2025-08-08T04:54:12Z)
Multi-Granular Attention based Heterogeneous Hypergraph Neural Network [5.580244361093485]
Heterogeneous graph neural networks (HeteGNNs) have demonstrated strong abilities to learn node representations.<n>This paper proposes MGA-HHN, a Multi-Granular Attention based Heterogeneous Hypergraph Neural Network for representation learning.
arXiv Detail & Related papers (2025-05-07T11:42:00Z)
Overlap-aware meta-learning attention to enhance hypergraph neural networks for node classification [7.822666400307049]
We propose a novel framework, overlap-aware meta-learning attention for hypergraph neural networks (OMA-HGNN)<n>First, we introduce a hypergraph attention mechanism that integrates both structural and feature similarities. Specifically, we linearly combine their respective losses with weighted factors for the HGNN model.<n>Second, we partition nodes into different tasks based on their diverse overlap levels and develop a multi-task Meta-Weight-Net (MWN) to determine the corresponding weighted factors.<n>Third, we jointly train the internal MWN model with the losses from the external HGNN model and train the external model with the weighted factors
arXiv Detail & Related papers (2025-03-11T01:38:39Z)
Scalable Weibull Graph Attention Autoencoder for Modeling Document Networks [50.42343781348247]
We develop a graph Poisson factor analysis (GPFA) which provides analytic conditional posteriors to improve the inference accuracy. We also extend GPFA to a multi-stochastic-layer version named graph Poisson gamma belief network (GPGBN) to capture the hierarchical document relationships at multiple semantic levels. Our models can extract high-quality hierarchical latent document representations and achieve promising performance on various graph analytic tasks.
arXiv Detail & Related papers (2024-10-13T02:22:14Z)
Towards a Unified Transformer-based Framework for Scene Graph Generation and Human-object Interaction Detection [116.21529970404653]
We introduce SG2HOI+, a unified one-step model based on the Transformer architecture. Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection. Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
arXiv Detail & Related papers (2023-11-03T07:25:57Z)
Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z)
Soft Hierarchical Graph Recurrent Networks for Many-Agent Partially Observable Environments [9.067091068256747]
We propose a novel network structure called hierarchical graph recurrent network(HGRN) for multi-agent cooperation under partial observability. Based on the above technologies, we proposed a value-based MADRL algorithm called Soft-HGRN and its actor-critic variant named SAC-HRGN.
arXiv Detail & Related papers (2021-09-05T09:51:25Z)
A Graph-based Interactive Reasoning for Human-Object Interaction Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs. We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet. Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
Multi-View Graph Neural Networks for Molecular Property Prediction [67.54644592806876]
We present Multi-View Graph Neural Network (MV-GNN), a multi-view message passing architecture. In MV-GNN, we introduce a shared self-attentive readout component and disagreement loss to stabilize the training process. We further boost the expressive power of MV-GNN by proposing a cross-dependent message passing scheme.
arXiv Detail & Related papers (2020-05-17T04:46:07Z)
Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding. At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network. With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
Graph Representation Learning via Graphical Mutual Information Maximization [86.32278001019854]
We propose a novel concept, Graphical Mutual Information (GMI), to measure the correlation between input graphs and high-level hidden representations. We develop an unsupervised learning model trained by maximizing GMI between the input and output of a graph neural encoder.
arXiv Detail & Related papers (2020-02-04T08:33:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.