Is Object Detection Necessary for Human-Object Interaction Recognition?
- URL: http://arxiv.org/abs/2107.13083v1
- Date: Tue, 27 Jul 2021 21:15:00 GMT
- Title: Is Object Detection Necessary for Human-Object Interaction Recognition?
- Authors: Ying Jin, Yinpeng Chen, Lijuan Wang, Jianfeng Wang, Pei Yu, Zicheng
Liu, Jenq-Neng Hwang
- Abstract summary: This paper revisits human-object interaction (HOI) recognition at image level without using supervisions of object location and human pose.
We name it detection-free HOI recognition, in contrast to the existing detection-supervised approaches.
- Score: 37.61038047282247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper revisits human-object interaction (HOI) recognition at image level
without using supervisions of object location and human pose. We name it
detection-free HOI recognition, in contrast to the existing
detection-supervised approaches which rely on object and keypoint detections to
achieve state of the art. With our method, not only the detection supervision
is evitable, but superior performance can be achieved by properly using
image-text pre-training (such as CLIP) and the proposed Log-Sum-Exp Sign
(LSE-Sign) loss function. Specifically, using text embeddings of class labels
to initialize the linear classifier is essential for leveraging the CLIP
pre-trained image encoder. In addition, LSE-Sign loss facilitates learning from
multiple labels on an imbalanced dataset by normalizing gradients over all
classes in a softmax format. Surprisingly, our detection-free solution achieves
60.5 mAP on the HICO dataset, outperforming the detection-supervised state of
the art by 13.4 mAP
Related papers
- Learning Camouflaged Object Detection from Noisy Pseudo Label [60.9005578956798]
This paper introduces the first weakly semi-supervised Camouflaged Object Detection (COD) method.
It aims for budget-efficient and high-precision camouflaged object segmentation with an extremely limited number of fully labeled images.
We propose a noise correction loss that facilitates the model's learning of correct pixels in the early learning stage.
When using only 20% of fully labeled data, our method shows superior performance over the state-of-the-art methods.
arXiv Detail & Related papers (2024-07-18T04:53:51Z) - Disentangled Pre-training for Human-Object Interaction Detection [22.653500926559833]
We propose an efficient disentangled pre-training method for HOI detection (DP-HOI)
DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers.
It significantly enhances the performance of existing HOI detection models on a broad range of rare categories.
arXiv Detail & Related papers (2024-04-02T08:21:16Z) - Learning Remote Sensing Object Detection with Single Point Supervision [17.12725535531483]
Pointly Supervised Object Detection (PSOD) has attracted considerable interests due to its lower labeling cost as compared to box-level supervised object detection.
We make the first attempt to achieve RS object detection with single point supervision, and propose a PSOD method tailored for RS images.
Our method can achieve significantly better performance as compared to state-of-the-art image-level and point-level supervised detection methods, and reduce the performance gap between PSOD and box-level supervised object detection.
arXiv Detail & Related papers (2023-05-23T15:06:04Z) - The Overlooked Classifier in Human-Object Interaction Recognition [82.20671129356037]
We encode the semantic correlation among classes into the classification head by initializing the weights with language embeddings of HOIs.
We propose a new loss named LSE-Sign to enhance multi-label learning on a long-tailed dataset.
Our simple yet effective method enables detection-free HOI classification, outperforming the state-of-the-arts that require object detection and human pose by a clear margin.
arXiv Detail & Related papers (2022-03-10T23:35:00Z) - Decoupling Object Detection from Human-Object Interaction Recognition [37.133695677465376]
DEFR is a DEtection-FRee method to recognize Human-Object Interactions (HOI) at image level without using object location or human pose.
We propose two findings to boost the performance of the detection-free approach, which significantly outperforms the detection-assisted state of the arts.
arXiv Detail & Related papers (2021-12-13T03:01:49Z) - Region-level Active Learning for Cluttered Scenes [60.93811392293329]
We introduce a new strategy that subsumes previous Image-level and Object-level approaches into a generalized, Region-level approach.
We show that this approach significantly decreases labeling effort and improves rare object search on realistic data with inherent class-imbalance and cluttered scenes.
arXiv Detail & Related papers (2021-08-20T14:02:38Z) - DAP: Detection-Aware Pre-training with Weak Supervision [37.336674323981285]
This paper presents a detection-aware pre-training (DAP) approach for object detection tasks.
We transform a classification dataset into a detection dataset through a weakly supervised object localization method based on Class Activation Maps.
We show that DAP can outperform the traditional classification pre-training in terms of both sample efficiency and convergence speed in downstream detection tasks including VOC and COCO.
arXiv Detail & Related papers (2021-03-30T19:48:30Z) - Instance Localization for Self-supervised Detection Pretraining [68.24102560821623]
We propose a new self-supervised pretext task, called instance localization.
We show that integration of bounding boxes into pretraining promotes better task alignment and architecture alignment for transfer learning.
Experimental results demonstrate that our approach yields state-of-the-art transfer learning results for object detection.
arXiv Detail & Related papers (2021-02-16T17:58:57Z) - Dense Label Encoding for Boundary Discontinuity Free Rotation Detection [69.75559390700887]
This paper explores a relatively less-studied methodology based on classification.
We propose new techniques to push its frontier in two aspects.
Experiments and visual analysis on large-scale public datasets for aerial images show the effectiveness of our approach.
arXiv Detail & Related papers (2020-11-19T05:42:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.