Fast Convergence of DETR with Spatially Modulated Co-Attention
- URL: http://arxiv.org/abs/2101.07448v1
- Date: Tue, 19 Jan 2021 03:52:44 GMT
- Title: Fast Convergence of DETR with Spatially Modulated Co-Attention
- Authors: Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai, Hongsheng Li
- Abstract summary: We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
- Score: 83.19863907905666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The recently proposed Detection Transformer (DETR) model successfully applies
Transformer to objects detection and achieves comparable performance with
two-stage object detection frameworks, such as Faster-RCNN. However, DETR
suffers from its slow convergence. Training DETR \cite{carion2020end} from
scratch needs 500 epochs to achieve a high accuracy. To accelerate its
convergence, we propose a simple yet effective scheme for improving the DETR
framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. The core
idea of SMCA is to conduct regression-aware co-attention in DETR by
constraining co-attention responses to be high near initially estimated
bounding box locations. Our proposed SMCA increases DETR's convergence speed by
replacing the original co-attention mechanism in the decoder while keeping
other operations in DETR unchanged. Furthermore, by integrating multi-head and
scale-selection attention designs into SMCA, our fully-fledged SMCA can achieve
better performance compared to DETR with a dilated convolution-based backbone
(45.6 mAP at 108 epochs vs. 43.3 mAP at 500 epochs). We perform extensive
ablation studies on COCO dataset to validate the effectiveness of the proposed
SMCA.
Related papers
- Relation DETR: Exploring Explicit Position Relation Prior for Object Detection [26.03892270020559]
We present a scheme for enhancing the convergence and performance of DETR (DEtection TRansformer)
Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement.
Experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-07-16T13:17:07Z) - Align-DETR: Improving DETR with Simple IoU-aware BCE loss [32.13866392998818]
We propose a metric, recall of best-regressed samples, to quantitively evaluate the misalignment problem.
The proposed loss, IA-BCE, guides the training of DETR to build a strong correlation between classification score and localization precision.
To overcome the dramatic decrease in sample quality induced by the sparsity of queries, we introduce a prime sample weighting mechanism.
arXiv Detail & Related papers (2023-04-15T10:24:51Z) - Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale
Feature Fusion [95.7732308775325]
The proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection.
DETR suffers from slow training convergence, which hinders its applicability to various detection tasks.
We design Semantic-Aligned-Matching DETR++ to accelerate DETR's convergence and improve detection performance.
arXiv Detail & Related papers (2022-07-28T15:34:29Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-08-05T06:53:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.