Related papers: Efficient Decoder-free Object Detection with Transformers

Efficient Decoder-free Object Detection with Transformers

URL: http://arxiv.org/abs/2206.06829v3
Date: Thu, 16 Jun 2022 01:53:08 GMT
Title: Efficient Decoder-free Object Detection with Transformers
Authors: Peixian Chen, Mengdan Zhang, Yunhang Shen, Kekai Sheng, Yuting Gao, Xing Sun, Ke Li, Chunhua Shen
Abstract summary: Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
Score: 75.00499377197475
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: Vision transformers (ViTs) are changing the landscape of object detection approaches. A natural usage of ViTs in detection is to replace the CNN-based backbone with a transformer-based backbone, which is straightforward and effective, with the price of bringing considerable computation burden for inference. More subtle usage is the DETR family, which eliminates the need for many hand-designed components in object detection but introduces a decoder demanding an extra-long time to converge. As a result, transformer-based object detection can not prevail in large-scale applications. To overcome these issues, we propose a novel decoder-free fully transformer-based (DFFT) object detector, achieving high efficiency in both training and inference stages, for the first time. We simplify objection detection into an encoder-only single-level anchor-based dense prediction problem by centering around two entry points: 1) Eliminate the training-inefficient decoder and leverage two strong encoders to preserve the accuracy of single-level feature map prediction; 2) Explore low-level semantic features for the detection task with limited computational resources. In particular, we design a novel lightweight detection-oriented transformer backbone that efficiently captures low-level features with rich semantics based on a well-conceived ablation study. Extensive experiments on the MS COCO benchmark demonstrate that DFFT_SMALL outperforms DETR by 2.5% AP with 28% computation cost reduction and more than $10$x fewer training epochs. Compared with the cutting-edge anchor-based detector RetinaNet, DFFT_SMALL obtains over 5.5% AP gain while cutting down 70% computation cost.

Related papers

Simplifying Two-Stage Detectors for On-Device Inference in Remote Sensing [0.7305342793164903]
We propose a model simplification method for two-stage object detectors. Our method reduces computation costs upto 61.2% with the accuracy loss within 2.1% on the DOTAv1.5 dataset.
arXiv Detail & Related papers (2024-04-11T00:45:10Z)
Integral Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection [78.2325219839805]
imTED improves the state-of-the-art of few-shot object detection by up to 7.6% AP. Experiments on MS COCO dataset demonstrate that imTED consistently outperforms its counterparts by 2.8%.
arXiv Detail & Related papers (2022-05-19T15:11:20Z)
An Extendable, Efficient and Effective Transformer-based Object Detector [95.06044204961009]
We integrate Vision and Detection Transformers (ViDT) to construct an effective and efficient object detector. ViDT introduces a reconfigured attention module to extend the recent Swin Transformer to be a standalone object detector. We extend it to ViDT+ to support joint-task learning for object detection and instance segmentation.
arXiv Detail & Related papers (2022-04-17T09:27:45Z)
SALISA: Saliency-based Input Sampling for Efficient Video Object Detection [58.22508131162269]
We propose SALISA, a novel non-uniform SALiency-based Input SAmpling technique for video object detection. We show that SALISA significantly improves the detection of small objects.
arXiv Detail & Related papers (2022-04-05T17:59:51Z)
CvT-ASSD: Convolutional vision-Transformer Based Attentive Single Shot MultiBox Detector [15.656374849760734]
We present a novel object detection architecture, named Convolutional vision Transformer Based Attentive Single Shot MultiBox Detector (CvT-ASSD) Our model CvT-ASSD can leads to good system efficiency and performance while being pretrained on large-scale detection datasets such as PASCAL VOC and MS COCO.
arXiv Detail & Related papers (2021-10-24T06:45:33Z)
Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network. We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution. Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z)
SADet: Learning An Efficient and Accurate Pedestrian Detector [68.66857832440897]
This paper proposes a series of systematic optimization strategies for the detection pipeline of one-stage detector. It forms a single shot anchor-based detector (SADet) for efficient and accurate pedestrian detection. Though structurally simple, it presents state-of-the-art result and real-time speed of $20$ FPS for VGA-resolution images.
arXiv Detail & Related papers (2020-07-26T12:32:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.