Decoupling Object Detection from Human-Object Interaction Recognition
- URL: http://arxiv.org/abs/2112.06392v1
- Date: Mon, 13 Dec 2021 03:01:49 GMT
- Title: Decoupling Object Detection from Human-Object Interaction Recognition
- Authors: Ying Jin, Yinpeng Chen, Lijuan Wang, Jianfeng Wang, Pei Yu, Lin Liang,
Jenq-Neng Hwang, Zicheng Liu
- Abstract summary: DEFR is a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose.
We propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts.
- Score: 37.133695677465376
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: We propose DEFR, a DEtection-FRee method to recognize Human-Object
Interactions (HOI) at image level without using object location or human pose.
This is challenging as the detector is an integral part of existing methods. In
this paper, we propose two findings to boost the performance of the
detection-free approach, which significantly outperforms the detection-assisted
state of the arts. Firstly, we find it crucial to effectively leverage the
semantic correlations among HOI classes. Remarkable gain can be achieved by
using language embeddings of HOI labels to initialize the linear classifier,
which encodes the structure of HOIs to guide training. Further, we propose
Log-Sum-Exp Sign (LSE-Sign) loss to facilitate multi-label learning on a
long-tailed dataset by balancing gradients over all classes in a softmax
format. Our detection-free approach achieves 65.6 mAP in HOI classification on
HICO, outperforming the detection-assisted state of the art (SOTA) by 18.5 mAP,
and 52.7 mAP in one-shot classes, surpassing the SOTA by 27.3 mAP. Different
from previous work, our classification model (DEFR) can be directly used in HOI
detection without any additional training, by connecting to an off-the-shelf
object detector whose bounding box output is converted to binary masks for
DEFR. Surprisingly, such a simple connection of two decoupled models achieves
SOTA performance (32.35 mAP).
Related papers
- Disentangled Pre-training for Human-Object Interaction Detection [22.653500926559833]
We propose an efficient disentangled pre-training method for HOI detection (DP-HOI)
DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers.
It significantly enhances the performance of existing HOI detection models on a broad range of rare categories.
arXiv Detail & Related papers (2024-04-02T08:21:16Z) - DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge
Distillation [86.41437210485932]
We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously.
We propose a novel end-to-end zero-shot HOI Detection framework via vision-language knowledge distillation.
Our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP.
arXiv Detail & Related papers (2022-04-01T07:27:19Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - G-DetKD: Towards General Distillation Framework for Object Detectors via
Contrastive and Semantic-guided Feature Imitation [49.421099172544196]
We propose a novel semantic-guided feature imitation technique, which automatically performs soft matching between feature pairs across all pyramid levels.
We also introduce contrastive distillation to effectively capture the information encoded in the relationship between different feature regions.
Our method consistently outperforms the existing detection KD techniques, and works when (1) components in the framework are used separately and in conjunction.
arXiv Detail & Related papers (2021-08-17T07:44:27Z) - Mining the Benefits of Two-stage and One-stage HOI Detection [26.919979955155664]
Two-stage methods have dominated Human-Object Interaction (HOI) detection for several years.
One-stage methods are challenging to make an appropriate trade-off on multi-task learning, i.e., object detection, and interaction classification.
We propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner.
arXiv Detail & Related papers (2021-08-11T07:38:09Z) - Is Object Detection Necessary for Human-Object Interaction Recognition? [37.61038047282247]
This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose.
We name it detection-free HOI recognition, in contrast to the existing detection-supervised approaches.
arXiv Detail & Related papers (2021-07-27T21:15:00Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.