Related papers: DETR++: Taming Your Multi-Scale Detection Transformer

DETR++: Taming Your Multi-Scale Detection Transformer

URL: http://arxiv.org/abs/2206.02977v1
Date: Tue, 7 Jun 2022 02:38:31 GMT
Title: DETR++: Taming Your Multi-Scale Detection Transformer
Authors: Chi Zhang, Lijuan Liu, Xiaoxue Zang, Frederick Liu, Hao Zhang, Xinying Song, Jindong Chen
Abstract summary: We introduce the Transformer-based detection method, i.e., DETR. Due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features. We propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction.
Score: 22.522422934209807
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Convolutional Neural Networks (CNN) have dominated the field of detection ever since the success of AlexNet in ImageNet classification [12]. With the sweeping reform of Transformers [27] in natural language processing, Carion et al. [2] introduce the Transformer-based detection method, i.e., DETR. However, due to the quadratic complexity in the self-attention mechanism in the Transformer, DETR is never able to incorporate multi-scale features as performed in existing CNN-based detectors, leading to inferior results in small object detection. To mitigate this issue and further improve performance of DETR, in this work, we investigate different methods to incorporate multi-scale features and find that a Bi-directional Feature Pyramid (BiFPN) works best with DETR in further raising the detection precision. With this discovery, we propose DETR++, a new architecture that improves detection results by 1.9% AP on MS COCO 2017, 11.5% AP on RICO icon detection, and 9.1% AP on RICO layout extraction over existing baselines.

Related papers

A DeNoising FPN With Transformer R-CNN for Tiny Object Detection [25.892598910922004]
We propose a new framework, DeNoising FPN with Trans R-CNN (DNTR), to improve the performance of tiny object detection. DNTR consists of an easy plug-in design, DeNoising FPN (DN-FPN), and an effective Transformer-based detector, Trans R-CNN. We replace the obsolete R-CNN detector with a novel Trans R-CNN detector to focus on the representation of tiny objects with self-attention.
arXiv Detail & Related papers (2024-06-09T12:18:15Z)
DETR Doesn't Need Multi-Scale or Locality Design [69.56292005230185]
This paper presents an improved DETR detector that maintains a "plain" nature. It uses a single-scale feature map and global cross-attention calculations without specific locality constraints. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints.
arXiv Detail & Related papers (2023-08-03T17:59:04Z)
Hierarchical Point Attention for Indoor 3D Object Detection [111.04397308495618]
This work proposes two novel attention operations as generic hierarchical designs for point-based transformer detectors. First, we propose Multi-Scale Attention (MS-A) that builds multi-scale tokens from a single-scale input feature to enable more fine-grained feature learning. Second, we propose Size-Adaptive Local Attention (Local-A) with adaptive attention regions for localized feature aggregation within bounding box proposals.
arXiv Detail & Related papers (2023-01-06T18:52:12Z)
CNN-transformer mixed model for object detection [3.5897534810405403]
In this paper, I propose a convolutional module with a transformer. It aims to improve the recognition accuracy of the model by fusing the detailed features extracted by CNN with the global features extracted by a transformer. After 100 rounds of training on the Pascal VOC dataset, the accuracy of the results reached 81%, which is 4.6 better than the faster RCNN[4] using resnet101[5] as the backbone.
arXiv Detail & Related papers (2022-12-13T16:35:35Z)
Efficient Decoder-free Object Detection with Transformers [75.00499377197475]
Vision transformers (ViTs) are changing the landscape of object detection approaches. We propose a decoder-free fully transformer-based (DFFT) object detector. DFFT_SMALL achieves high efficiency in both training and inference stages.
arXiv Detail & Related papers (2022-06-14T13:22:19Z)
DFTR: Depth-supervised Hierarchical Feature Fusion Transformer for Salient Object Detection [44.94166578314837]
We propose a pure Transformer-based SOD framework, namely Depth-supervised hierarchical feature Fusion TRansformer (DFTR) We extensively evaluate the proposed DFTR on ten benchmarking datasets. Experimental results show that our DFTR consistently outperforms the existing state-of-the-art methods for both RGB and RGB-D SOD tasks.
arXiv Detail & Related papers (2022-03-12T12:59:12Z)
Simple Training Strategies and Model Scaling for Object Detection [38.27709720726833]
We benchmark improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors. The vanilla detectors are improved by 7.7% in accuracy while being 30% faster in speed. Our largest Cascade RCNN-RS models achieve 52.9% AP with a ResNet152-FPN backbone and 53.6% with a SpineNet143L backbone.
arXiv Detail & Related papers (2021-06-30T18:41:47Z)
Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network. We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution. Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z)
DA-DETR: Domain Adaptive Detection Transformer with Information Fusion [53.25930448542148]
DA-DETR is a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain. We introduce a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains. CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization.
arXiv Detail & Related papers (2021-03-31T13:55:56Z)
Rethinking Transformer-based Set Prediction for Object Detection [57.7208561353529]
Experimental results show that the proposed methods not only converge much faster than the original DETR, but also significantly outperform DETR and other baselines in terms of detection accuracy.
arXiv Detail & Related papers (2020-11-21T21:59:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.