Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection
- URL: http://arxiv.org/abs/2311.01755v1
- Date: Fri, 3 Nov 2023 07:25:57 GMT
- Title: Towards a Unified Transformer-based Framework for Scene Graph Generation
and Human-object Interaction Detection
- Authors: Tao He, Lianli Gao, Jingkuan Song, Yuan-Fang Li
- Abstract summary: We introduce SG2HOI+, a unified one-step model based on the Transformer architecture.
Our approach employs two interactive hierarchical Transformers to seamlessly unify the tasks of SGG and HOI detection.
Our approach achieves competitive performance when compared to state-of-the-art HOI methods.
- Score: 116.21529970404653
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Scene graph generation (SGG) and human-object interaction (HOI) detection are
two important visual tasks aiming at localising and recognising relationships
between objects, and interactions between humans and objects, respectively.
Prevailing works treat these tasks as distinct tasks, leading to the
development of task-specific models tailored to individual datasets. However,
we posit that the presence of visual relationships can furnish crucial
contextual and intricate relational cues that significantly augment the
inference of human-object interactions. This motivates us to think if there is
a natural intrinsic relationship between the two tasks, where scene graphs can
serve as a source for inferring human-object interactions. In light of this, we
introduce SG2HOI+, a unified one-step model based on the Transformer
architecture. Our approach employs two interactive hierarchical Transformers to
seamlessly unify the tasks of SGG and HOI detection. Concretely, we initiate a
relation Transformer tasked with generating relation triples from a suite of
visual features. Subsequently, we employ another transformer-based decoder to
predict human-object interactions based on the generated relation triples. A
comprehensive series of experiments conducted across established benchmark
datasets including Visual Genome, V-COCO, and HICO-DET demonstrates the
compelling performance of our SG2HOI+ model in comparison to prevalent
one-stage SGG models. Remarkably, our approach achieves competitive performance
when compared to state-of-the-art HOI methods. Additionally, we observe that
our SG2HOI+ jointly trained on both SGG and HOI tasks in an end-to-end manner
yields substantial improvements for both tasks compared to individualized
training paradigms.
Related papers
- S^2Former-OR: Single-Stage Bimodal Transformer for Scene Graph
Generation in OR [52.964721233679406]
Scene graph generation (SGG) of surgical procedures is crucial in enhancing holistically cognitive intelligence in the operating room (OR)
Previous works have primarily relied on the multi-stage learning that generates semantic scene graphs dependent on intermediate processes with pose estimation and object detection.
In this study, we introduce a novel single-stage bimodal transformer framework for SGG in the OR, termed S2Former-OR.
arXiv Detail & Related papers (2024-02-22T11:40:49Z) - Learning Mutual Excitation for Hand-to-Hand and Human-to-Human
Interaction Recognition [22.538114033191313]
We propose a mutual excitation graph convolutional network (me-GCN) by stacking mutual excitation graph convolution layers.
Me-GC learns mutual information in each layer and each stage of graph convolution operations.
Our proposed me-GC outperforms state-of-the-art GCN-based and Transformer-based methods.
arXiv Detail & Related papers (2024-02-04T10:00:00Z) - Generating Human-Centric Visual Cues for Human-Object Interaction
Detection via Large Vision-Language Models [59.611697856666304]
Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions.
We propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans.
We develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - IGFormer: Interaction Graph Transformer for Skeleton-based Human
Interaction Recognition [26.05948629634753]
We propose a novel Interaction Graph Transformer (IGFormer) network for skeleton-based interaction recognition.
IGFormer constructs interaction graphs according to the semantic and distance correlations between the interactive body parts.
We also propose a Semantic Partition Module to transform each human skeleton sequence into a Body-Part-Time sequence.
arXiv Detail & Related papers (2022-07-25T12:11:15Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Cascaded Human-Object Interaction Recognition [175.60439054047043]
We introduce a cascade architecture for a multi-stage, coarse-to-fine HOI understanding.
At each stage, an instance localization network progressively refines HOI proposals and feeds them into an interaction recognition network.
With our carefully-designed human-centric relation features, these two modules work collaboratively towards effective interaction understanding.
arXiv Detail & Related papers (2020-03-09T17:05:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.