Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors
- URL: http://arxiv.org/abs/2303.05546v1
- Date: Thu, 9 Mar 2023 19:08:02 GMT
- Title: Weakly-Supervised HOI Detection from Interaction Labels Only and
Language/Vision-Language Priors
- Authors: Mesut Erhan Unal and Adriana Kovashka
- Abstract summary: Human-object interaction (HOI) detection aims to extract interacting human-object pairs and their interaction categories from a given natural image.
In this paper, we tackle HOI detection with the weakest supervision setting in the literature, using only image-level interaction labels.
We first propose an approach to prune non-interacting human and object proposals to increase the quality of positive pairs within the bag, exploiting the grounding capability of the vision-language model.
Second, we use a large language model to query which interactions are possible between a human and a given object category, in order to force the model not to put emphasis
- Score: 36.75629570208193
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-object interaction (HOI) detection aims to extract interacting
human-object pairs and their interaction categories from a given natural image.
Even though the labeling effort required for building HOI detection datasets is
inherently more extensive than for many other computer vision tasks,
weakly-supervised directions in this area have not been sufficiently explored
due to the difficulty of learning human-object interactions with weak
supervision, rooted in the combinatorial nature of interactions over the object
and predicate space. In this paper, we tackle HOI detection with the weakest
supervision setting in the literature, using only image-level interaction
labels, with the help of a pretrained vision-language model (VLM) and a large
language model (LLM). We first propose an approach to prune non-interacting
human and object proposals to increase the quality of positive pairs within the
bag, exploiting the grounding capability of the vision-language model. Second,
we use a large language model to query which interactions are possible between
a human and a given object category, in order to force the model not to put
emphasis on unlikely interactions. Lastly, we use an auxiliary
weakly-supervised preposition prediction task to make our model explicitly
reason about space. Extensive experiments and ablations show that all of our
contributions increase HOI detection performance.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - A Review of Human-Object Interaction Detection [6.1941885271010175]
Human-object interaction (HOI) detection plays a key role in high-level visual understanding.
This paper systematically summarizes and discusses the recent work in image-based HOI detection.
arXiv Detail & Related papers (2024-08-20T08:32:39Z) - Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection [9.788417605537965]
We introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement.
Our proposed method achieves state-of-the-art results in open vocabulary HOI detection.
arXiv Detail & Related papers (2024-04-09T10:27:22Z) - Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection.
We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z) - Detecting Any Human-Object Interaction Relationship: Universal HOI
Detector with Spatial Prompt Learning on Foundation Models [55.20626448358655]
This study explores the universal interaction recognition in an open-world setting through the use of Vision-Language (VL) foundation models and large language models (LLMs)
Our design includes an HO Prompt-guided Decoder (HOPD), facilitates the association of high-level relation representations in the foundation model with various HO pairs within the image.
For open-category interaction recognition, our method supports either of two input types: interaction phrase or interpretive sentence.
arXiv Detail & Related papers (2023-11-07T08:27:32Z) - Knowledge Guided Bidirectional Attention Network for Human-Object
Interaction Detection [3.0915392100355192]
We argue that the independent use of the bottom-up parsing strategy in HOI is counter-intuitive and could lead to the diffusion of attention.
We introduce a novel knowledge-guided top-down attention into HOI, and propose to model the relation parsing as a "look and search" process.
We implement the process via unifying the bottom-up and top-down attention in a single encoder-decoder based model.
arXiv Detail & Related papers (2022-07-16T16:42:49Z) - INVIGORATE: Interactive Visual Grounding and Grasping in Clutter [56.00554240240515]
INVIGORATE is a robot system that interacts with human through natural language and grasps a specified object in clutter.
We train separate neural networks for object detection, for visual grounding, for question generation, and for OBR detection and grasping.
We build a partially observable Markov decision process (POMDP) that integrates the learned neural network modules.
arXiv Detail & Related papers (2021-08-25T07:35:21Z) - Human Object Interaction Detection using Two-Direction Spatial
Enhancement and Exclusive Object Prior [28.99655101929647]
Human-Object Interaction (HOI) detection aims to detect visual relations between human and objects in images.
Non-interactive human-object pair can be easily mis-grouped and misclassified as an action.
We propose a spatial enhancement approach to enforce fine-level spatial constraints in two directions.
arXiv Detail & Related papers (2021-05-07T07:18:27Z) - Detecting Human-Object Interaction via Fabricated Compositional Learning [106.37536031160282]
Human-Object Interaction (HOI) detection is a fundamental task for high-level scene understanding.
Human has extremely powerful compositional perception ability to cognize rare or unseen HOI samples.
We propose Fabricated Compositional Learning (FCL) to address the problem of open long-tailed HOI detection.
arXiv Detail & Related papers (2021-03-15T08:52:56Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.