Miti-DETR: Object Detection based on Transformers with Mitigatory
Self-Attention Convergence
- URL: http://arxiv.org/abs/2112.13310v1
- Date: Sun, 26 Dec 2021 03:23:59 GMT
- Title: Miti-DETR: Object Detection based on Transformers with Mitigatory
Self-Attention Convergence
- Authors: Wenchi Ma, Tianxiao Zhang, Guanghui Wang
- Abstract summary: We propose a transformer architecture with a mitigatory self-attention mechanism.
Miti-DETR reserves the inputs of each single attention layer to the outputs of that layer so that the "non-attention" information has participated in attention propagation.
Miti-DETR significantly enhances the average detection precision and convergence speed towards existing DETR-based models.
- Score: 17.854940064699985
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Object Detection with Transformers (DETR) and related works reach or even
surpass the highly-optimized Faster-RCNN baseline with self-attention network
architectures. Inspired by the evidence that pure self-attention possesses a
strong inductive bias that leads to the transformer losing the expressive power
with respect to network depth, we propose a transformer architecture with a
mitigatory self-attention mechanism by applying possible direct mapping
connections in the transformer architecture to mitigate the rank collapse so as
to counteract feature expression loss and enhance the model performance. We
apply this proposal in object detection tasks and develop a model named
Miti-DETR. Miti-DETR reserves the inputs of each single attention layer to the
outputs of that layer so that the "non-attention" information has participated
in any attention propagation. The formed residual self-attention network
addresses two critical issues: (1) stop the self-attention networks from
degenerating to rank-1 to the maximized degree; and (2) further diversify the
path distribution of parameter update so that easier attention learning is
expected. Miti-DETR significantly enhances the average detection precision and
convergence speed towards existing DETR-based models on the challenging COCO
object detection dataset. Moreover, the proposed transformer with the residual
self-attention network can be easily generalized or plugged in other related
task models without specific customization.
Related papers
- Rethinking Attention: Exploring Shallow Feed-Forward Neural Networks as
an Alternative to Attention Layers in Transformers [5.356051655680145]
This work presents an analysis of the effectiveness of using standard shallow feed-forward networks to mimic the behavior of the attention mechanism in the original Transformer model.
We substitute key elements of the attention mechanism in the Transformer with simple feed-forward networks, trained using the original components via knowledge distillation.
Our experiments, conducted on the IWSLT 2017 dataset, reveal the capacity of these "attentionless Transformers" to rival the performance of the original architecture.
arXiv Detail & Related papers (2023-11-17T16:58:52Z) - Investigating the Robustness and Properties of Detection Transformers
(DETR) Toward Difficult Images [1.5727605363545245]
Transformer-based object detectors (DETR) have shown significant performance across machine vision tasks.
The critical issue to be addressed is how this model architecture can handle different image nuisances.
We studied this issue by measuring the performance of DETR with different experiments and benchmarking the network.
arXiv Detail & Related papers (2023-10-12T23:38:52Z) - Convolution and Attention Mixer for Synthetic Aperture Radar Image
Change Detection [41.38587746899477]
Synthetic aperture radar (SAR) image change detection is a critical task and has received increasing attentions in the remote sensing community.
Existing SAR change detection methods are mainly based on convolutional neural networks (CNNs)
We propose a convolution and attention mixer (CAMixer) to incorporate global attention.
arXiv Detail & Related papers (2023-09-21T12:28:23Z) - SeqCo-DETR: Sequence Consistency Training for Self-Supervised Object
Detection with Transformers [18.803007408124156]
We propose SeqCo-DETR, a Sequence Consistency-based self-supervised method for object DEtection with TRansformers.
Our method achieves state-of-the-art results on MS COCO (45.8 AP) and PASCAL VOC (64.1 AP), demonstrating the effectiveness of our approach.
arXiv Detail & Related papers (2023-03-15T09:36:58Z) - Joint Spatial-Temporal and Appearance Modeling with Transformer for
Multiple Object Tracking [59.79252390626194]
We propose a novel solution named TransSTAM, which leverages Transformer to model both the appearance features of each object and the spatial-temporal relationships among objects.
The proposed method is evaluated on multiple public benchmarks including MOT16, MOT17, and MOT20, and it achieves a clear performance improvement in both IDF1 and HOTA.
arXiv Detail & Related papers (2022-05-31T01:19:18Z) - DepthFormer: Exploiting Long-Range Correlation and Local Information for
Accurate Monocular Depth Estimation [50.08080424613603]
Long-range correlation is essential for accurate monocular depth estimation.
We propose to leverage the Transformer to model this global context with an effective attention mechanism.
Our proposed model, termed DepthFormer, surpasses state-of-the-art monocular depth estimation methods with prominent margins.
arXiv Detail & Related papers (2022-03-27T05:03:56Z) - Oriented Object Detection with Transformer [51.634913687632604]
We implement Oriented Object DEtection with TRansformer ($bf O2DETR$) based on an end-to-end network.
We design a simple but highly efficient encoder for Transformer by replacing the attention mechanism with depthwise separable convolution.
Our $rm O2DETR$ can be another new benchmark in the field of oriented object detection, which achieves up to 3.85 mAP improvement over Faster R-CNN and RetinaNet.
arXiv Detail & Related papers (2021-06-06T14:57:17Z) - DA-DETR: Domain Adaptive Detection Transformer with Information Fusion [53.25930448542148]
DA-DETR is a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain.
We introduce a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains.
CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization.
arXiv Detail & Related papers (2021-03-31T13:55:56Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.