Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
- URL: http://arxiv.org/abs/2304.07527v2
- Date: Mon, 23 Dec 2024 11:30:51 GMT
- Title: Align-DETR: Enhancing End-to-end Object Detection with Aligned Loss
- Authors: Zhi Cai, Songtao Liu, Guodong Wang, Zheng Ge, Xiangyu Zhang, Di Huang,
- Abstract summary: This paper identifies two key forms of misalignment within the model.
We introduce a novel loss function, termed as Align Loss, to resolve the discrepancy between the two tasks.
Our method achieves a 49.3% (+0.6) AP on the H-DETR baseline with the ResNet-50 backbone.
- Score: 35.11300328598727
- License:
- Abstract: DETR has set up a simple end-to-end pipeline for object detection by formulating this task as a set prediction problem, showing promising potential. Despite its notable advancements, this paper identifies two key forms of misalignment within the model: classification-regression misalignment and cross-layer target misalignment. Both issues impede DETR's convergence and degrade its overall performance. To tackle both issues simultaneously, we introduce a novel loss function, termed as Align Loss, designed to resolve the discrepancy between the two tasks. Align Loss guides the optimization of DETR through a joint quality metric, strengthening the connection between classification and regression. Furthermore, it incorporates an exponential down-weighting term to facilitate a smooth transition from positive to negative samples. Align-DETR also employs many-to-one matching for supervision of intermediate layers, akin to the design of H-DETR, which enhances robustness against instability. We conducted extensive experiments, yielding highly competitive results. Notably, our method achieves a 49.3% (+0.6) AP on the H-DETR baseline with the ResNet-50 backbone. It also sets a new state-of-the-art performance, reaching 50.5% AP in the 1x setting and 51.7% AP in the 2x setting, surpassing several strong competitors. Our code is available at https://github.com/FelixCaae/AlignDETR.
Related papers
- Relation DETR: Exploring Explicit Position Relation Prior for Object Detection [26.03892270020559]
We present a scheme for enhancing the convergence and performance of DETR (DEtection TRansformer)
Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement.
Experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-07-16T13:17:07Z) - Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement [19.277560848076984]
Two-stage selection strategies result in scale bias and redundancy due to mismatch between selected queries and objects.
We propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries.
The proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets.
arXiv Detail & Related papers (2024-03-24T13:01:57Z) - Theoretically Achieving Continuous Representation of Oriented Bounding Boxes [64.15627958879053]
This paper endeavors to completely solve the issue of discontinuity in Oriented Bounding Box representation.
We propose a novel representation method called Continuous OBB (COBB) which can be readily integrated into existing detectors.
For fairness and transparency of experiments, we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation.
arXiv Detail & Related papers (2024-02-29T09:27:40Z) - End-to-End Lane detection with One-to-Several Transformer [6.79236957488334]
O2SFormer converges 12.5x faster than DETR for the ResNet18 backbone.
O2SFormer with ResNet50 backbone achieves 77.83% F1 score on CULane dataset, outperforming existing Transformer-based and CNN-based detectors.
arXiv Detail & Related papers (2023-05-01T06:07:11Z) - Detection Transformer with Stable Matching [48.963171068785435]
We show that the most important design is to use and only use positional metrics to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost.
We achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings.
arXiv Detail & Related papers (2023-04-10T17:55:37Z) - DETRs with Hybrid Matching [21.63116788914251]
One-to-one set matching is a key design for DETR to establish its end-to-end capability.
We propose a hybrid matching scheme that combines the original one-to-one matching branch with an auxiliary one-to-many matching branch during training.
arXiv Detail & Related papers (2022-07-26T17:52:14Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z) - Higher Performance Visual Tracking with Dual-Modal Localization [106.91097443275035]
Visual Object Tracking (VOT) has synchronous needs for both robustness and accuracy.
We propose a dual-modal framework for target localization, consisting of robust localization suppressingors via ONR and the accurate localization attending to the target center precisely via OFC.
arXiv Detail & Related papers (2021-03-18T08:47:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.