Efficient Decoder-free Object Detection with Transformers
- URL: http://arxiv.org/abs/2206.06829v3
- Date: Thu, 16 Jun 2022 01:53:08 GMT
- Title: Efficient Decoder-free Object Detection with Transformers
- Authors: Peixian Chen, Mengdan Zhang, Yunhang Shen, Kekai Sheng, Yuting Gao,
Xing Sun, Ke Li, Chunhua Shen
- Abstract summary: Vision transformers (ViTs) are changing the landscape of object detection approaches.
We propose a decoder-free fully transformer-based (DFFT) object detector.
DFFT_SMALL achieves high efficiency in both training and inference stages.
- Score: 75.00499377197475
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Vision transformers (ViTs) are changing the landscape of object detection
approaches. A natural usage of ViTs in detection is to replace the CNN-based
backbone with a transformer-based backbone, which is straightforward and
effective, with the price of bringing considerable computation burden for
inference. More subtle usage is the DETR family, which eliminates the need for
many hand-designed components in object detection but introduces a decoder
demanding an extra-long time to converge. As a result, transformer-based object
detection can not prevail in large-scale applications. To overcome these
issues, we propose a novel decoder-free fully transformer-based (DFFT) object
detector, achieving high efficiency in both training and inference stages, for
the first time. We simplify objection detection into an encoder-only
single-level anchor-based dense prediction problem by centering around two
entry points: 1) Eliminate the training-inefficient decoder and leverage two
strong encoders to preserve the accuracy of single-level feature map
prediction; 2) Explore low-level semantic features for the detection task with
limited computational resources. In particular, we design a novel lightweight
detection-oriented transformer backbone that efficiently captures low-level
features with rich semantics based on a well-conceived ablation study.
Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL
outperforms DETR by 2.5% AP with 28% computation cost reduction and more than
$10$x fewer training epochs. Compared with the cutting-edge anchor-based
detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70%
computation cost.
Related papers
- Simplifying Two-Stage Detectors for On-Device Inference in Remote Sensing [0.7305342793164903]
We propose a model simplification method for two-stage object detectors.
Our method reduces computation costs upto 61.2% with the accuracy loss within 2.1% on the DOTAv1.5 dataset.
arXiv Detail & Related papers (2024-04-11T00:45:10Z) - Integral Migrating Pre-trained Transformer Encoder-decoders for Visual
Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP.
Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z) - An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector.
ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector.
We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z) - SALISA: Saliency-based Input Sampling for Efficient Video Object
Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection.
We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z) - CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot
MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD)
Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z) - Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network.
We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution.
Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z) - SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector.
It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection.
Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.