Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection
- URL: http://arxiv.org/abs/2311.01755v1
- Date: Fri, 3 Nov 2023 07:25:57 GMT
- Title: Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection
- Authors: Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li
- Abstract summary: We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
- Score: 116.21529970404653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene graph generation (SGG) and human-object interaction (HOI) detection are
two important visual tasks aiming at localising and recognising relationships
between objects, and interactions between humans and objects, respectively.
Prevailing works treat these tasks as distinct tasks, leading to the
development of task-specific models tailored to individual datasets. However,
we posit that the presence of visual relationships can furnish crucial
contextual and intricate relational cues that significantly augment the
inference of human-object interactions. This motivates us to think if there is
a natural intrinsic relationship between the two tasks, where scene graphs can
serve as a source for inferring human-object interactions. In light of this, we
introduce SG2HOI+, a unified one-step model based on the Transformer
architecture. Our approach employs two interactive hierarchical Transformers to
seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a
relation Transformer tasked with generating relation triples from a suite of
visual features. Subsequently, we employ another transformer-based decoder to
predict human-object interactions based on the generated relation triples. A
comprehensive series of experiments conducted across established benchmark
datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the
compelling performance of our SG2HOI+ model in comparison to prevalent
one-stage SGG models. Remarkably, our approach achieves competitive performance
when compared to state-of-the-art HOI methods. Additionally, we observe that
our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner
yields substantial improvements for both tasks compared to individualized
training paradigms.
Related papers
- Understanding Spatio-Temporal Relations in Human-Object Interaction using Pyramid Graph Convolutional Network [2.223052975765005]
We propose a novel Pyramid Graph Convolutional Network (PGCN) to automatically recognize human-object interaction.
The system represents the 2D or 3D spatial relation of human and objects from the detection results in video data as a graph.
We evaluate our model on two challenging datasets in the field of human-object interaction recognition.
arXiv Detail & Related papers (2024-10-10T13:39:17Z) - S^2Former-OR: Single-Stage Bi-Modal Transformer for Scene Graph Generation in OR [50.435592120607815]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on multi-stage learning, where the generated semantic scene graphs depend on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bi-modal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Learning Mutual Excitation for Hand-to-Hand and Human-to-Human
Interaction Recognition [22.538114033191313]
We propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution layers.
Me-GC learns mutual information in each layer and each stage of graph convolution operations.
Our proposed me-GC outperforms state-of-the-art GCN-based and Transformer-based methods.
arXiv Detail & Related papers (2024-02-04T10:00:00Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - IGFormer: Interaction Graph Transformer for Skeleton-based Human
Interaction Recognition [26.05948629634753]
We propose a novel Interaction Graph Transformer (IGFormer) network for skeleton-based interaction recognition.
IGFormer constructs interaction graphs according to the semantic and distance correlations between the interactive body parts.
We also propose a Semantic Partition Module to transform each human skeleton sequence into a Body-Part-Time sequence.
arXiv Detail & Related papers (2022-07-25T12:11:15Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.