Generating Human-Centric Visual Cues for Human-Object Interaction
Detection via Large Vision-Language Models
- URL: http://arxiv.org/abs/2311.16475v1
- Date: Sun, 26 Nov 2023 09:11:32 GMT
- Title: Generating Human-Centric Visual Cues for Human-Object Interaction
Detection via Large Vision-Language Models
- Authors: Yu-Wei Zhan, Fan Liu, Xin Luo, Liqiang Nie, Xin-Shun Xu, Mohan
Kankanhalli
- Abstract summary: Human-object interaction (HOI) detection aims at detecting human-object pairs and predicting their interactions.
We propose three prompts with VLM to generate human-centric visual cues within an image from multiple perspectives of humans.
We develop a transformer-based multimodal fusion module with multitower architecture to integrate visual cue features into the instance and interaction decoders.
- Score: 59.611697856666304
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-object interaction (HOI) detection aims at detecting human-object pairs
and predicting their interactions. However, the complexity of human behavior
and the diverse contexts in which these interactions occur make it challenging.
Intuitively, human-centric visual cues, such as the involved participants, the
body language, and the surrounding environment, play crucial roles in shaping
these interactions. These cues are particularly vital in interpreting unseen
interactions. In this paper, we propose three prompts with VLM to generate
human-centric visual cues within an image from multiple perspectives of humans.
To capitalize on these rich Human-Centric Visual Cues, we propose a novel
approach named HCVC for HOI detection. Particularly, we develop a
transformer-based multimodal fusion module with multitower architecture to
integrate visual cue features into the instance and interaction decoders. Our
extensive experiments and analysis validate the efficacy of leveraging the
generated human-centric visual cues for HOI detection. Notably, the
experimental results indicate the superiority of the proposed model over the
existing state-of-the-art methods on two widely used datasets.
Related papers
- Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption [64.07607726562841]
Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration.
In this work, we tackle the task of reconstructing closely interactive humans from a monocular video.
We propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information.
arXiv Detail & Related papers (2024-04-17T11:55:45Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - HODN: Disentangling Human-Object Feature for HOI Detection [51.48164941412871]
We propose a Human and Object Disentangling Network (HODN) to model the Human-Object Interaction (HOI) relationships explicitly.
Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions.
Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det Linking datasets.
arXiv Detail & Related papers (2023-08-20T04:12:50Z) - Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors [36.75629570208193]
Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
arXiv Detail & Related papers (2023-03-09T19:08:02Z) - Human-Object Interaction Detection:A Quick Survey and Examination of
Methods [17.8805983491991]
This is the first general survey of the state-of-the-art and milestone works in this field.
We provide a basic survey of the developments in the field of human-object interaction detection.
We examine the HORCNN architecture as it is a foundational work in the field.
arXiv Detail & Related papers (2020-09-27T20:58:39Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.