Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics
- URL: http://arxiv.org/abs/2202.00259v1
- Date: Tue, 1 Feb 2022 07:39:04 GMT
- Title: Detecting Human-Object Interactions with Object-Guided Cross-Modal
Calibrated Semantics
- Authors: Hangjie Yuan, Mang Wang, Dong Ni and Liangpeng Xu
- Abstract summary: We aim to boost end-to-end models with object-guided statistical priors.
We propose to utilize a Verb Semantic Model (VSM) and use semantic aggregation to profit from this object-guided hierarchy.
The above modules combined composes Object-guided Cross-modal Network (OCN)
- Score: 6.678312249123534
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection is an essential task to understand
human-centric images from a fine-grained perspective. Although end-to-end HOI
detection models thrive, their paradigm of parallel human/object detection and
verb class prediction loses two-stage methods' merit: object-guided hierarchy.
The object in one HOI triplet gives direct clues to the verb to be predicted.
In this paper, we aim to boost end-to-end models with object-guided statistical
priors. Specifically, We propose to utilize a Verb Semantic Model (VSM) and use
semantic aggregation to profit from this object-guided hierarchy. Similarity KL
(SKL) loss is proposed to optimize VSM to align with the HOI dataset's priors.
To overcome the static semantic embedding problem, we propose to generate
cross-modality-aware visual and semantic features by Cross-Modal Calibration
(CMC). The above modules combined composes Object-guided Cross-modal
Calibration Network (OCN). Experiments conducted on two popular HOI detection
benchmarks demonstrate the significance of incorporating the statistical prior
knowledge and produce state-of-the-art performances. More detailed analysis
indicates proposed modules serve as a stronger verb predictor and a more
superior method of utilizing prior knowledge. The codes are available at
\url{https://github.com/JacobYuan7/OCN-HOI-Benchmark}.
Related papers
- A Modern Take on Visual Relationship Reasoning for Grasp Planning [10.543168383800532]
We present a modern take on visual relational reasoning for grasp planning.
We introduce D3GD, a novel testbed that includes bin picking scenes with up to 35 objects from 97 distinct categories.
We also propose D3G, a new end-to-end transformer-based dependency graph generation model.
arXiv Detail & Related papers (2024-09-03T16:30:48Z) - A Plug-and-Play Method for Rare Human-Object Interactions Detection by Bridging Domain Gap [50.079224604394]
We present a novel model-agnostic framework called textbfContext-textbfEnhanced textbfFeature textbfAment (CEFA)
CEFA consists of a feature alignment module and a context enhancement module.
Our method can serve as a plug-and-play module to improve the detection performance of HOI models on rare categories.
arXiv Detail & Related papers (2024-07-31T08:42:48Z) - Exploring Self- and Cross-Triplet Correlations for Human-Object
Interaction Detection [38.86053346974547]
We propose to explore Self- and Cross-Triplet Correlations for HOI detection.
Specifically, we regard each triplet proposal as a graph where Human, Object represent nodes and Action indicates edge.
Also, we try to explore cross-triplet dependencies by jointly considering instance-level, semantic-level, and layout-level relations.
arXiv Detail & Related papers (2024-01-11T05:38:24Z) - Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos [63.94040814459116]
Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence.
We propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps.
We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations.
arXiv Detail & Related papers (2023-08-19T09:12:13Z) - Weakly-supervised Contrastive Learning for Unsupervised Object Discovery [52.696041556640516]
Unsupervised object discovery is promising due to its ability to discover objects in a generic manner.
We design a semantic-guided self-supervised learning model to extract high-level semantic features from images.
We introduce Principal Component Analysis (PCA) to localize object regions.
arXiv Detail & Related papers (2023-07-07T04:03:48Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - 3DMODT: Attention-Guided Affinities for Joint Detection & Tracking in 3D
Point Clouds [95.54285993019843]
We propose a method for joint detection and tracking of multiple objects in 3D point clouds.
Our model exploits temporal information employing multiple frames to detect objects and track them in a single network.
arXiv Detail & Related papers (2022-11-01T20:59:38Z) - Consistency Learning via Decoding Path Augmentation for Transformers in
Human Object Interaction Detection [11.928724924319138]
We propose cross-path consistency learning (CPC) to improve HOI detection for transformers.
Our experiments demonstrate the effectiveness of our method, and we achieved significant improvement on V-COCO and HICO-DET.
arXiv Detail & Related papers (2022-04-11T02:45:00Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.