Exploring Structure-aware Transformer over Interaction Proposals for
Human-Object Interaction Detection
- URL: http://arxiv.org/abs/2206.06291v1
- Date: Mon, 13 Jun 2022 16:21:08 GMT
- Title: Exploring Structure-aware Transformer over Interaction Proposals for
Human-Object Interaction Detection
- Authors: Yong Zhang and Yingwei Pan and Ting Yao and Rui Huang and Tao Mei and
Chang-Wen Chen
- Abstract summary: We design a novel Transformer-style Human-Object Interaction (HOI) detector, i.e., Structure-aware Transformer over Interaction Proposals (STIP)
STIP decomposes the process of HOI set prediction into two subsequent phases, i.e., an interaction proposal generation is first performed, and then followed by transforming the non-parametric interaction proposals into HOI predictions via a structure-aware Transformer.
The structure-aware Transformer upgrades vanilla Transformer by encoding additionally the holistically semantic structure among interaction proposals as well as the locally spatial structure of human/object within each interaction proposal, so as to strengthen HOI
- Score: 119.93025368028083
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent high-performing Human-Object Interaction (HOI) detection techniques
have been highly influenced by Transformer-based object detector (i.e., DETR).
Nevertheless, most of them directly map parametric interaction queries into a
set of HOI predictions through vanilla Transformer in a one-stage manner. This
leaves rich inter- or intra-interaction structure under-exploited. In this
work, we design a novel Transformer-style HOI detector, i.e., Structure-aware
Transformer over Interaction Proposals (STIP), for HOI detection. Such design
decomposes the process of HOI set prediction into two subsequent phases, i.e.,
an interaction proposal generation is first performed, and then followed by
transforming the non-parametric interaction proposals into HOI predictions via
a structure-aware Transformer. The structure-aware Transformer upgrades vanilla
Transformer by encoding additionally the holistically semantic structure among
interaction proposals as well as the locally spatial structure of human/object
within each interaction proposal, so as to strengthen HOI predictions.
Extensive experiments conducted on V-COCO and HICO-DET benchmarks have
demonstrated the effectiveness of STIP, and superior results are reported when
comparing with the state-of-the-art HOI detectors. Source code is available at
\url{https://github.com/zyong812/STIP}.
Related papers
- Neural-Logic Human-Object Interaction Detection [67.4993347702353]
We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities.
Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the human, action, object> triplet and constitute novel interactions.
We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities.
arXiv Detail & Related papers (2023-11-16T11:47:53Z) - Object Detection with Transformers: A Review [11.255962936937744]
This paper provides a comprehensive review of 21 recently proposed advancements in the original DETR model.
We conduct a comparative analysis across various detection transformers, evaluating their performance and network architectures.
We hope that this study will ignite further interest among researchers in addressing the existing challenges and exploring the application of transformers in the object detection domain.
arXiv Detail & Related papers (2023-06-07T16:13:38Z) - Full Stack Optimization of Transformer Inference: a Survey [58.55475772110702]
Transformer models achieve superior accuracy across a wide range of applications.
The amount of compute and bandwidth required for inference of recent Transformer models is growing at a significant rate.
There has been an increased focus on making Transformer models more efficient.
arXiv Detail & Related papers (2023-02-27T18:18:13Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - Consistency Learning via Decoding Path Augmentation for Transformers in
Human Object Interaction Detection [11.928724924319138]
We propose cross-path consistency learning (CPC) to improve HOI detection for transformers.
Our experiments demonstrate the effectiveness of our method, and we achieved significant improvement on V-COCO and HICO-DET.
arXiv Detail & Related papers (2022-04-11T02:45:00Z) - What to look at and where: Semantic and Spatial Refined Transformer for
detecting human-object interactions [26.87434934565539]
We propose a one-stage Transformer-based semantic and spatial refined transformer (SSRT) to solve the Human-Object Interaction detection task.
Two new modules help select the most relevant object-action pairs within an image and refine the queries' representation using rich semantic and spatial features.
These enhancements lead to state-of-the-art results on the two most popular HOI benchmarks: V-COCO and HICO-DET.
arXiv Detail & Related papers (2022-04-02T02:41:31Z) - Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer [41.44769642537572]
Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
arXiv Detail & Related papers (2021-12-03T10:52:06Z) - ViDT: An Efficient and Effective Fully Transformer-based Object Detector [97.71746903042968]
Detection transformers are the first fully end-to-end learning systems for object detection.
vision transformers are the first fully transformer-based architecture for image classification.
In this paper, we integrate Vision and Detection Transformers (ViDT) to build an effective and efficient object detector.
arXiv Detail & Related papers (2021-10-08T06:32:05Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.