Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
- URL: http://arxiv.org/abs/2507.01630v2
- Date: Wed, 23 Jul 2025 09:22:32 GMT
- Title: Prompt Guidance and Human Proximal Perception for HOT Prediction with Regional Joint Loss
- Authors: Yuxiao Wang, Yu Lei, Zhenao Wei, Weiying Xue, Xinyu Jiang, Nan Zhuang, Qi Liu,
- Abstract summary: Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects.<n>We propose a textbfP3HOT' framework, which blends textbfPrompt guidance and human textbfProximal textbfPerception.<n>Our approach achieves an improvement of textbf0.7$uparrow$, textbf2.0$uparrow$, textbf1.6$uparrow$, and textbf
- Score: 9.87816757989266
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The task of Human-Object conTact (HOT) detection involves identifying the specific areas of the human body that are touching objects. Nevertheless, current models are restricted to just one type of image, often leading to too much segmentation in areas with little interaction, and struggling to maintain category consistency within specific regions. To tackle this issue, a HOT framework, termed \textbf{P3HOT}, is proposed, which blends \textbf{P}rompt guidance and human \textbf{P}roximal \textbf{P}erception. To begin with, we utilize a semantic-driven prompt mechanism to direct the network's attention towards the relevant regions based on the correlation between image and text. Then a human proximal perception mechanism is employed to dynamically perceive key depth range around the human, using learnable parameters to effectively eliminate regions where interactions are not expected. Calculating depth resolves the uncertainty of the overlap between humans and objects in a 2D perspective, providing a quasi-3D viewpoint. Moreover, a Regional Joint Loss (RJLoss) has been created as a new loss to inhibit abnormal categories in the same area. A new evaluation metric called ``AD-Acc.'' is introduced to address the shortcomings of existing methods in addressing negative samples. Comprehensive experimental results demonstrate that our approach achieves state-of-the-art performance in four metrics across two benchmark datasets. Specifically, our model achieves an improvement of \textbf{0.7}$\uparrow$, \textbf{2.0}$\uparrow$, \textbf{1.6}$\uparrow$, and \textbf{11.0}$\uparrow$ in SC-Acc., mIoU, wIoU, and AD-Acc. metrics, respectively, on the HOT-Annotated dataset. The sources code are available at https://github.com/YuxiaoWang-AI/P3HOT.
Related papers
- Exploring Mutual Cross-Modal Attention for Context-Aware Human Affordance Generation [18.73832646369506]
We propose a novel cross-attention mechanism to encode the scene context for affordance prediction in 2D scenes.<n>First, we sample a probable location for a person within the scene using a variational autoencoder conditioned on the global scene context encoding.<n>Next, we predict a potential pose template from a set of existing human pose candidates using a classifier on the local context encoding.
arXiv Detail & Related papers (2025-02-19T11:24:45Z) - Precision-Enhanced Human-Object Contact Detection via Depth-Aware Perspective Interaction and Object Texture Restoration [10.840465766762902]
Human-object contact (HOT) is designed to accurately identify the areas where humans and objects come into contact.<n>Current methods fail to account for scenarios where objects are frequently blocking the view, resulting in inaccurate identification of contact areas.<n>We suggest using a perspective interaction HOT detector called PIHOT, which utilizes a depth map generation model to offer depth information of humans and objects related to the camera.
arXiv Detail & Related papers (2024-12-13T07:15:52Z) - AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation [55.179287851188036]
We introduce a novel all-in-one-stage framework, AiOS, for expressive human pose and shape recovery without an additional human detection step.
We first employ a human token to probe a human location in the image and encode global features for each instance.
Then, we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature.
arXiv Detail & Related papers (2024-03-26T17:59:23Z) - Non-Local Latent Relation Distillation for Self-Adaptive 3D Human Pose
Estimation [63.199549837604444]
3D human pose estimation approaches leverage different forms of strong (2D/3D pose) or weak (multi-view or depth) paired supervision.
We cast 3D pose learning as a self-supervised adaptation problem that aims to transfer the task knowledge from a labeled source domain to a completely unpaired target.
We evaluate different self-adaptation settings and demonstrate state-of-the-art 3D human pose estimation performance on standard benchmarks.
arXiv Detail & Related papers (2022-04-05T03:52:57Z) - Human Object Interaction Detection using Two-Direction Spatial
Enhancement and Exclusive Object Prior [28.99655101929647]
Human-Object Interaction (HOI) detection aims to detect visual relations between human and objects in images.
Non-interactive human-object pair can be easily mis-grouped and misclassified as an action.
We propose a spatial enhancement approach to enforce fine-level spatial constraints in two directions.
arXiv Detail & Related papers (2021-05-07T07:18:27Z) - DIRV: Dense Interaction Region Voting for End-to-End Human-Object
Interaction Detection [53.40028068801092]
We propose a novel one-stage HOI detection approach based on a new concept called interaction region for the HOI problem.
Unlike previous methods, our approach concentrates on the densely sampled interaction regions across different scales for each human-object pair.
In order to compensate for the detection flaws of a single interaction region, we introduce a novel voting strategy.
arXiv Detail & Related papers (2020-10-02T13:57:58Z) - DRG: Dual Relation Graph for Human-Object Interaction Detection [65.50707710054141]
We tackle the challenging problem of human-object interaction (HOI) detection.
Existing methods either recognize the interaction of each human-object pair in isolation or perform joint inference based on complex appearance-based features.
In this paper, we leverage an abstract spatial-semantic representation to describe each human-object pair and aggregate the contextual information of the scene via a dual relation graph.
arXiv Detail & Related papers (2020-08-26T17:59:40Z) - HDNet: Human Depth Estimation for Multi-Person Camera-Space Localization [83.57863764231655]
We propose the Human Depth Estimation Network (HDNet), an end-to-end framework for absolute root joint localization.
A skeleton-based Graph Neural Network (GNN) is utilized to propagate features among joints.
We evaluate our HDNet on the root joint localization and root-relative 3D pose estimation tasks with two benchmark datasets.
arXiv Detail & Related papers (2020-07-17T12:44:23Z) - A Graph-based Interactive Reasoning for Human-Object Interaction
Detection [71.50535113279551]
We present a novel graph-based interactive reasoning model called Interactive Graph (abbr. in-Graph) to infer HOIs.
We construct a new framework to assemble in-Graph models for detecting HOIs, namely in-GraphNet.
Our framework is end-to-end trainable and free from costly annotations like human pose.
arXiv Detail & Related papers (2020-07-14T09:29:03Z) - PPDM: Parallel Point Detection and Matching for Real-time Human-Object
Interaction Detection [85.75935399090379]
We propose a single-stage Human-Object Interaction (HOI) detection method that has outperformed all existing methods on HICO-DET dataset at 37 fps.
It is the first real-time HOI detection method.
arXiv Detail & Related papers (2019-12-30T12:00:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.