Related papers: QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information

QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information

URL: http://arxiv.org/abs/2103.05399v1
Date: Tue, 9 Mar 2021 12:42:54 GMT
Title: QPIC: Query-Based Pairwise Human-Object Interaction Detection with Image-Wide Contextual Information
Authors: Masato Tamura, Hiroki Ohashi, Tomoaki Yoshinaga
Abstract summary: We propose a simple, intuitive yet powerful method for human-object interaction (HOI) detection. Existing CNN-based methods face the following three major drawbacks. The proposed method successfully extracts contextually important features.
Score: 3.6739949215165164
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: We propose a simple, intuitive yet powerful method for human-object interaction (HOI) detection. HOIs are so diverse in spatial distribution in an image that existing CNN-based methods face the following three major drawbacks; they cannot leverage image-wide features due to CNN's locality, they rely on a manually defined location-of-interest for the feature aggregation, which sometimes does not cover contextually important regions, and they cannot help but mix up the features for multiple HOI instances if they are located closely. To overcome these drawbacks, we propose a transformer-based feature extractor, in which an attention mechanism and query-based detection play key roles. The attention mechanism is effective in aggregating contextually important information image-wide, while the queries, which we design in such a way that each query captures at most one human-object pair, can avoid mixing up the features from multiple instances. This transformer-based feature extractor produces so effective embeddings that the subsequent detection heads may be fairly simple and intuitive. The extensive analysis reveals that the proposed method successfully extracts contextually important features, and thus outperforms existing methods by large margins (5.37 mAP on HICO-DET, and 5.7 mAP on V-COCO). The source codes are available at $\href{https://github.com/hitachi-rd-cv/qpic}{\text{this https URL}}$.

Related papers

PoIFusion: Multi-Modal 3D Object Detection via Fusion at Points of Interest [65.48057241587398]
PoIFusion is a framework to fuse information of RGB images and LiDAR point clouds at the points of interest (PoIs) Our approach maintains the view of each modality and obtains multi-modal features by computation-friendly projection and computation. We conducted extensive experiments on nuScenes and Argoverse2 datasets to evaluate our approach.
arXiv Detail & Related papers (2024-03-14T09:28:12Z)
Disentangled Interaction Representation for One-Stage Human-Object Interaction Detection [70.96299509159981]
Human-Object Interaction (HOI) detection is a core task for human-centric image understanding. Recent one-stage methods adopt a transformer decoder to collect image-wide cues that are useful for interaction prediction. Traditional two-stage methods benefit significantly from their ability to compose interaction features in a disentangled and explainable manner.
arXiv Detail & Related papers (2023-12-04T08:02:59Z)
Enhancing HOI Detection with Contextual Cues from Large Vision-Language Models [56.257840490146]
ConCue is a novel approach for improving visual feature extraction in HOI detection. We develop a transformer-based feature extraction module with a multi-tower architecture that integrates contextual cues into both instance and interaction detectors.
arXiv Detail & Related papers (2023-11-26T09:11:32Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
Correlation-Aware Deep Tracking [83.51092789908677]
We propose a novel target-dependent feature network inspired by the self-/cross-attention scheme. Our network deeply embeds cross-image feature correlation in multiple layers of the feature network. Our model can be flexibly pre-trained on abundant unpaired images, leading to notably faster convergence than the existing methods.
arXiv Detail & Related papers (2022-03-03T11:53:54Z)
QAHOI: Query-Based Anchors for Human-Object Interaction Detection [29.548384966666013]
One-stage approaches have become a new trend for this task due to their high efficiency. We propose a transformer-based method, QAHOI, which uses query-based anchors to predict all the elements of an HOI instance. We investigate that a powerful backbone significantly increases accuracy for QAHOI, and QAHOI with a transformer-based backbone outperforms recent state-of-the-art methods by large margins on the HICO-DET benchmark.
arXiv Detail & Related papers (2021-12-16T05:52:23Z)
Reformulating HOI Detection as Adaptive Set Prediction [25.44630995307787]
We reformulate HOI detection as an adaptive set prediction problem. We propose an Adaptive Set-based one-stage framework (AS-Net) with parallel instance and interaction branches. Our method outperforms previous state-of-the-art methods without any extra human pose and language features.
arXiv Detail & Related papers (2021-03-10T10:40:33Z)
MultiResolution Attention Extractor for Small Object Detection [40.74232149130456]
Small objects are difficult to detect because of their low resolution and small size. Inspired by human vision "attention" mechanism, we exploit two feature extraction methods to mine the most useful information of small objects.
arXiv Detail & Related papers (2020-06-10T16:47:56Z)
Learning Human-Object Interaction Detection using Interaction Points [140.0200950601552]
We propose a novel fully-convolutional approach that directly detects the interactions between human-object pairs. Our network predicts interaction points, which directly localize and classify the inter-action. Experiments are performed on two popular benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2020-03-31T08:42:06Z)
Pixel-Semantic Revise of Position Learning A One-Stage Object Detector with A Shared Encoder-Decoder [5.371825910267909]
We analyze that different methods detect objects adaptively. Some state-of-the-art detectors combine different feature pyramids with many mechanisms to enhance multi-level semantic information. This work addresses that by an anchor-free detector with shared encoder-decoder with attention mechanism.
arXiv Detail & Related papers (2020-01-04T08:55:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.