DETR Doesn't Need Multi-Scale or Locality Design
- URL: http://arxiv.org/abs/2308.01904v1
- Date: Thu, 3 Aug 2023 17:59:04 GMT
- Title: DETR Doesn't Need Multi-Scale or Locality Design
- Authors: Yutong Lin, Yuhui Yuan, Zheng Zhang, Chen Li, Nanning Zheng, Han Hu
- Abstract summary: This paper presents an improved DETR detector that maintains a "plain" nature.
It uses a single-scale feature map and global cross-attention calculations without specific locality constraints.
We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints.
- Score: 69.56292005230185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper presents an improved DETR detector that maintains a "plain"
nature: using a single-scale feature map and global cross-attention
calculations without specific locality constraints, in contrast to previous
leading DETR-based detectors that reintroduce architectural inductive biases of
multi-scale and locality into the decoder. We show that two simple technologies
are surprisingly effective within a plain design to compensate for the lack of
multi-scale feature maps and locality constraints. The first is a box-to-pixel
relative position bias (BoxRPB) term added to the cross-attention formulation,
which well guides each query to attend to the corresponding object region while
also providing encoding flexibility. The second is masked image modeling
(MIM)-based backbone pre-training which helps learn representation with
fine-grained localization ability and proves crucial for remedying dependencies
on the multi-scale feature maps. By incorporating these technologies and recent
advancements in training and problem formation, the improved "plain" DETR
showed exceptional improvements over the original DETR detector. By leveraging
the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a
Swin-L backbone, which is highly competitive with state-of-the-art detectors
which all heavily rely on multi-scale feature maps and region-based feature
extraction. Code is available at https://github.com/impiga/Plain-DETR .
Related papers
- Deep Homography Estimation for Visual Place Recognition [49.235432979736395]
We propose a transformer-based deep homography estimation (DHE) network.
It takes the dense feature map extracted by a backbone network as input and fits homography for fast and learnable geometric verification.
Experiments on benchmark datasets show that our method can outperform several state-of-the-art methods.
arXiv Detail & Related papers (2024-02-25T13:22:17Z) - Unleash the Potential of Image Branch for Cross-modal 3D Object
Detection [67.94357336206136]
We present a new cross-modal 3D object detector, namely UPIDet, which aims to unleash the potential of the image branch from two aspects.
First, UPIDet introduces a new 2D auxiliary task called normalized local coordinate map estimation.
Second, we discover that the representational capability of the point cloud backbone can be enhanced through the gradients backpropagated from the training objectives of the image branch.
arXiv Detail & Related papers (2023-01-22T08:26:58Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Focus on Local: Detecting Lane Marker from Bottom Up via Key Point [10.617793053931964]
We propose a novel lane marker detection solution, FOLOLane, that focuses on modeling local patterns and achieving prediction of global structures.
Specifically, the CNN models lowcomplexity local patterns with two separate heads, the first one predicts the existence of key points, and the second refines the location of key points in the local range and correlates key points of the same lane line.
arXiv Detail & Related papers (2021-05-28T08:59:14Z) - PGL: Prior-Guided Local Self-supervised Learning for 3D Medical Image
Segmentation [87.50205728818601]
We propose a PriorGuided Local (PGL) self-supervised model that learns the region-wise local consistency in the latent feature space.
Our PGL model learns the distinctive representations of local regions, and hence is able to retain structural information.
arXiv Detail & Related papers (2020-11-25T11:03:11Z) - Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture.
We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions.
Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z) - ASLFeat: Learning Local Features of Accurate Shape and Localization [42.70030492742363]
We present ASLFeat, with three light-weight yet effective modifications to mitigate above issues.
First, we resort to deformable convolutional networks to densely estimate and apply local transformation.
Second, we take advantage of the inherent feature hierarchy to restore spatial resolution and low-level details for accurate keypoint localization.
arXiv Detail & Related papers (2020-03-23T04:03:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.