Fast Convergence of DETR with Spatially Modulated Co-Attention
- URL: http://arxiv.org/abs/2108.02404v1
- Date: Thu, 5 Aug 2021 06:53:19 GMT
- Title: Fast Convergence of DETR with Spatially Modulated Co-Attention
- Authors: Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, Hongsheng Li
- Abstract summary: We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
- Score: 83.19863907905666
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The recently proposed Detection Transformer (DETR) model successfully applies
Transformer to objects detection and achieves comparable performance with
two-stage object detection frameworks, such as Faster-RCNN. However, DETR
suffers from its slow convergence. Training DETR from scratch needs 500 epochs
to achieve a high accuracy. To accelerate its convergence, we propose a simple
yet effective scheme for improving the DETR framework, namely Spatially
Modulated Co-Attention (SMCA) mechanism. The core idea of SMCA is to conduct
location-aware co-attention in DETR by constraining co-attention responses to
be high near initially estimated bounding box locations. Our proposed SMCA
increases DETR's convergence speed by replacing the original co-attention
mechanism in the decoder while keeping other operations in DETR unchanged.
Furthermore, by integrating multi-head and scale-selection attention designs
into SMCA, our fully-fledged SMCA can achieve better performance compared to
DETR with a dilated convolution-based backbone (45.6 mAP at 108 epochs vs. 43.3
mAP at 500 epochs). We perform extensive ablation studies on COCO dataset to
validate SMCA. Code is released at https://github.com/gaopengcuhk/SMCA-DETR .
Related papers
- Detection Transformer with Stable Matching [48.963171068785435]
We show that the most important design is to use and only use positional metrics to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost.
We achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings.
arXiv Detail & Related papers (2023-04-10T17:55:37Z) - Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale
Feature Fusion [95.7732308775325]
The proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection.
DETR suffers from slow training convergence, which hinders its applicability to various detection tasks.
We design Semantic-Aligned-Matching DETR++ to accelerate DETR's convergence and improve detection performance.
arXiv Detail & Related papers (2022-07-28T15:34:29Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-01-19T03:52:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.