Mining Conditional Part Semantics with Occluded Extrapolation for
Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2307.10499v2
- Date: Mon, 13 Nov 2023 09:23:53 GMT
- Title: Mining Conditional Part Semantics with Occluded Extrapolation for
Human-Object Interaction Detection
- Authors: Guangzhi Wang, Yangyang Guo, Mohan Kankanhalli
- Abstract summary: Human-Object Interaction Detection is a crucial aspect of human-centric scene understanding.
Existing methods try to use human-related clues to alleviate the difficulty, but rely heavily on external annotations or knowledge.
We propose a novel Part Semantic Network (PSN) to solve this problem.
- Score: 16.9278983497498
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human-Object Interaction Detection is a crucial aspect of human-centric scene
understanding, with important applications in various domains. Despite recent
progress in this field, recognizing subtle and detailed interactions remains
challenging. Existing methods try to use human-related clues to alleviate the
difficulty, but rely heavily on external annotations or knowledge, limiting
their practical applicability in real-world scenarios. In this work, we propose
a novel Part Semantic Network (PSN) to solve this problem. The core of PSN is a
Conditional Part Attention (CPA) mechanism, where human features are taken as
keys and values, and the object feature is used as query for the computation in
a cross-attention mechanism. In this way, our model learns to automatically
focus on the most informative human parts conditioned on the involved object,
generating more semantically meaningful features for interaction recognition.
Additionally, we propose an Occluded Part Extrapolation (OPE) strategy to
facilitate interaction recognition under occluded scenarios, which teaches the
model to extrapolate detailed features from partially occluded ones. Our method
consistently outperforms prior approaches on the V-COCO and HICO-DET datasets,
without external data or extra annotations. Additional ablation studies
validate the effectiveness of each component of our proposed method.
Related papers
- Visual-Geometric Collaborative Guidance for Affordance Learning [63.038406948791454]
We propose a visual-geometric collaborative guided affordance learning network that incorporates visual and geometric cues.
Our method outperforms the representative models regarding objective metrics and visual quality.
arXiv Detail & Related papers (2024-10-15T07:35:51Z) - Disentangled Interaction Representation for One-Stage Human-Object
Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding.
Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction.
Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z) - HODN: Disentangling Human-Object Feature for HOI Detection [51.48164941412871]
We propose a Human and Object Disentangling Network (HODN) to model the Human-Object Interaction (HOI) relationships explicitly.
Considering that human features are more contributive to interaction, we propose a Human-Guide Linking method to make sure the interaction decoder focuses on the human-centric regions.
Our proposed method achieves competitive performance on both the V-COCO and the HICO-Det Linking datasets.
arXiv Detail & Related papers (2023-08-20T04:12:50Z) - Knowledge Guided Bidirectional Attention Network for Human-Object
Interaction Detection [3.0915392100355192]
We argue that the independent use of the bottom-up parsing strategy in HOI is counter-intuitive and could lead to the diffusion of attention.
We introduce a novel knowledge-guided top-down attention into HOI, and propose to model the relation parsing as a "look and search" process.
We implement the process via unifying the bottom-up and top-down attention in a single encoder-decoder based model.
arXiv Detail & Related papers (2022-07-16T16:42:49Z) - Effective Actor-centric Human-object Interaction Detection [20.564689533862524]
We propose a novel actor-centric framework to detect Human-Object Interaction in images.
Our method achieves the state-of-the-art on the challenging V-COCO and HICO-DET benchmarks.
arXiv Detail & Related papers (2022-02-24T10:24:44Z) - Learning Intuitive Policies Using Action Features [7.260481131198059]
We investigate the effect of network architecture on the propensity of learning algorithms to exploit semantic relationships.
We find that attention-based architectures that jointly process a featurized representation of observations and actions have a better inductive bias for learning intuitive policies.
arXiv Detail & Related papers (2022-01-29T20:54:52Z) - Skeleton-Based Mutually Assisted Interacted Object Localization and
Human Action Recognition [111.87412719773889]
We propose a joint learning framework for "interacted object localization" and "human action recognition" based on skeleton data.
Our method achieves the best or competitive performance with the state-of-the-art methods for human action recognition.
arXiv Detail & Related papers (2021-10-28T10:09:34Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs.
Our network predicts interaction points, which directly localize and classify the inter-action.
Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.