Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer
- URL: http://arxiv.org/abs/2112.01838v1
- Date: Fri, 3 Dec 2021 10:52:06 GMT
- Title: Efficient Two-Stage Detection of Human-Object Interactions with a Novel
Unary-Pairwise Transformer
- Authors: Frederic Z. Zhang, Dylan Campbell and Stephen Gould
- Abstract summary: Unary-Pairwise Transformer is a two-stage detector that exploits unary and pairwise representations for HOIs.
We evaluate our method on the HICO-DET and V-COCO datasets, and significantly outperform state-of-the-art approaches.
- Score: 41.44769642537572
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Recent developments in transformer models for visual data have led to
significant improvements in recognition and detection tasks. In particular,
using learnable queries in place of region proposals has given rise to a new
class of one-stage detection models, spearheaded by the Detection Transformer
(DETR). Variations on this one-stage approach have since dominated human-object
interaction (HOI) detection. However, the success of such one-stage HOI
detectors can largely be attributed to the representation power of
transformers. We discovered that when equipped with the same transformer, their
two-stage counterparts can be more performant and memory-efficient, while
taking a fraction of the time to train. In this work, we propose the
Unary-Pairwise Transformer, a two-stage detector that exploits unary and
pairwise representations for HOIs. We observe that the unary and pairwise parts
of our transformer network specialise, with the former preferentially
increasing the scores of positive examples and the latter decreasing the scores
of negative examples. We evaluate our method on the HICO-DET and V-COCO
datasets, and significantly outperform state-of-the-art approaches. At
inference time, our model with ResNet50 approaches real-time performance on a
single GPU.
Related papers
- DOAD: Decoupled One Stage Action Detection Network [77.14883592642782]
Localizing people and recognizing their actions from videos is a challenging task towards high-level video understanding.
Existing methods are mostly two-stage based, with one stage for person bounding box generation and the other stage for action recognition.
We present a decoupled one-stage network dubbed DOAD, to improve the efficiency for-temporal action detection.
arXiv Detail & Related papers (2023-04-01T08:06:43Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - Few-Shot Object Detection with Fully Cross-Transformer [35.49840687007507]
Few-shot object detection (FSOD) aims to detect novel objects using very few training examples.
We propose a novel Fully Cross-Transformer based model (FCT) for FSOD by incorporating cross-transformer into both the feature backbone and detection head.
Our model can improve the few-shot similarity learning between the two branches by introducing the multi-level interactions.
arXiv Detail & Related papers (2022-03-28T18:28:51Z) - Iwin: Human-Object Interaction Detection via Transformer with Irregular
Windows [57.00864538284686]
Iwin Transformer is a hierarchical Transformer which progressively performs token representation learning and token agglomeration within irregular windows.
The effectiveness and efficiency of Iwin Transformer are verified on the two standard HOI detection benchmark datasets.
arXiv Detail & Related papers (2022-03-20T12:04:50Z) - Towards Data-Efficient Detection Transformers [77.43470797296906]
We show most detection transformers suffer from significant performance drops on small-size datasets.
We empirically analyze the factors that affect data efficiency, through a step-by-step transition from a data-efficient RCNN variant to the representative DETR.
We introduce a simple yet effective label augmentation method to provide richer supervision and improve data efficiency.
arXiv Detail & Related papers (2022-03-17T17:56:34Z) - CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot
MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD)
Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z) - Visual Saliency Transformer [127.33678448761599]
We develop a novel unified model based on a pure transformer, Visual Saliency Transformer (VST), for both RGB and RGB-D salient object detection (SOD)
It takes image patches as inputs and leverages the transformer to propagate global contexts among image patches.
Experimental results show that our model outperforms existing state-of-the-art results on both RGB and RGB-D SOD benchmark datasets.
arXiv Detail & Related papers (2021-04-25T08:24:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.