Towards Hard-Positive Query Mining for DETR-based Human-Object
Interaction Detection
- URL: http://arxiv.org/abs/2207.05293v1
- Date: Tue, 12 Jul 2022 04:03:12 GMT
- Title: Towards Hard-Positive Query Mining for DETR-based Human-Object
Interaction Detection
- Authors: Xubin Zhong, Changxing Ding, Zijian Li, and Shaoli Huang
- Abstract summary: Human-Object Interaction (HOI) detection is a core task for high-level image understanding.
In this paper, we propose to enhance Detection Transformer (DETR)-based HOI detectors by mining hard-positive queries.
Experimental results show that our proposed approach can be widely applied to existing DETR-based HOI detectors.
- Score: 20.809479387186506
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Human-Object Interaction (HOI) detection is a core task for high-level image
understanding. Recently, Detection Transformer (DETR)-based HOI detectors have
become popular due to their superior performance and efficient structure.
However, these approaches typically adopt fixed HOI queries for all testing
images, which is vulnerable to the location change of objects in one specific
image. Accordingly, in this paper, we propose to enhance DETR's robustness by
mining hard-positive queries, which are forced to make correct predictions
using partial visual cues. First, we explicitly compose hard-positive queries
according to the ground-truth (GT) position of labeled human-object pairs for
each training image. Specifically, we shift the GT bounding boxes of each
labeled human-object pair so that the shifted boxes cover only a certain
portion of the GT ones. We encode the coordinates of the shifted boxes for each
labeled human-object pair into an HOI query. Second, we implicitly construct
another set of hard-positive queries by masking the top scores in
cross-attention maps of the decoder layers. The masked attention maps then only
cover partial important cues for HOI predictions. Finally, an alternate
strategy is proposed that efficiently combines both types of hard queries. In
each iteration, both DETR's learnable queries and one selected type of
hard-positive queries are adopted for loss computation. Experimental results
show that our proposed approach can be widely applied to existing DETR-based
HOI detectors. Moreover, we consistently achieve state-of-the-art performance
on three benchmarks: HICO-DET, V-COCO, and HOI-A. Code is available at
https://github.com/MuchHair/HQM.
Related papers
- A Hitchhikers Guide to Fine-Grained Face Forgery Detection Using Common Sense Reasoning [9.786907179872815]
The potential of vision and language remains underexplored in face forgery detection.
There is a need for a methodology that converts face forgery detection to a Visual Question Answering (VQA) task.
We propose a multi-staged approach that diverges from the traditional binary decision paradigm to address this gap.
arXiv Detail & Related papers (2024-10-01T08:16:40Z) - SEED: A Simple and Effective 3D DETR in Point Clouds [72.74016394325675]
We argue that the main challenges are challenging due to the high sparsity and uneven distribution of point clouds.
We propose a simple and effective 3D DETR method (SEED) for detecting 3D objects from point clouds.
arXiv Detail & Related papers (2024-07-15T14:21:07Z) - Disentangled Pre-training for Human-Object Interaction Detection [22.653500926559833]
We propose an efficient disentangled pre-training method for HOI detection (DP-HOI)
DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers.
It significantly enhances the performance of existing HOI detection models on a broad range of rare categories.
arXiv Detail & Related papers (2024-04-02T08:21:16Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - DETR with Additional Global Aggregation for Cross-domain Weakly
Supervised Object Detection [34.14603473160207]
This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD)
We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism.
The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment.
arXiv Detail & Related papers (2023-04-14T12:16:42Z) - Query-based Hard-Image Retrieval for Object Detection at Test Time [10.63460618121976]
We reformulate the problem of finding "hard" images as a query-based hard image retrieval task.
Our method is entirely post-hoc, does not require ground-truth annotations, and relies on an efficient Monte Carlo estimation.
We provide results on ranking and classification tasks using the widely used RetinaNet, Faster-RCNN, Mask-RCNN, and Cascade Mask-RCNN object detectors.
arXiv Detail & Related papers (2022-09-23T12:33:31Z) - Salient Object Ranking with Position-Preserved Attention [44.94722064885407]
We study the Salient Object Ranking (SOR) task, which manages to assign a ranking order of each detected object according to its visual saliency.
We propose the first end-to-end framework of the SOR task and solve it in a multi-task learning fashion.
We also introduce a Position-Preserved Attention (PPA) module tailored for the SOR branch.
arXiv Detail & Related papers (2021-06-09T13:00:05Z) - MRDet: A Multi-Head Network for Accurate Oriented Object Detection in
Aerial Images [51.227489316673484]
We propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors.
To obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network.
Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately.
arXiv Detail & Related papers (2020-12-24T06:36:48Z) - ConsNet: Learning Consistency Graph for Zero-Shot Human-Object
Interaction Detection [101.56529337489417]
We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of human, action, object> in images.
We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs.
Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities.
arXiv Detail & Related papers (2020-08-14T09:11:18Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - EHSOD: CAM-Guided End-to-end Hybrid-Supervised Object Detection with
Cascade Refinement [53.69674636044927]
We present EHSOD, an end-to-end hybrid-supervised object detection system.
It can be trained in one shot on both fully and weakly-annotated data.
It achieves comparable results on multiple object detection benchmarks with only 30% fully-annotated data.
arXiv Detail & Related papers (2020-02-18T08:04:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.