RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision
- URL: http://arxiv.org/abs/2409.08475v1
- Date: Fri, 13 Sep 2024 02:02:07 GMT
- Title: RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision
- Authors: Shuo Wang, Chunlong Xia, Feng Lv, Yifeng Shi,
- Abstract summary: We propose a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3.
To address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation.
RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series.
- Score: 7.721101317599364
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: RT-DETR is the first real-time end-to-end transformer-based object detector. Its efficiency comes from the framework design and the Hungarian matching. However, compared to dense supervision detectors like the YOLO series, the Hungarian matching provides much sparser supervision, leading to insufficient model training and difficult to achieve optimal results. To address these issues, we proposed a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. Firstly, we introduce a CNN-based auxiliary branch that provides dense supervision that collaborates with the original decoder to enhance the encoder feature representation. Secondly, to address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. This strategy diversifies label assignment for positive samples across multiple query groups, thereby enriching positive supervisions. Additionally, we introduce a shared-weight decoder branch for dense positive supervision to ensure more high-quality queries matching each ground truth. Notably, all aforementioned modules are training-only. We conduct extensive experiments to demonstrate the effectiveness of our approach on COCO val2017. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series. For example, RT-DETRv3-R18 achieves 48.1% AP (+1.6%/+1.4%) compared to RT-DETR-R18/RT-DETRv2-R18 while maintaining the same latency. Meanwhile, it requires only half of epochs to attain a comparable performance. Furthermore, RT-DETRv3-R101 can attain an impressive 54.6% AP outperforming YOLOv10-X. Code will be released soon.
Related papers
- RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer [2.1186155813156926]
RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR.
To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales.
To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator.
arXiv Detail & Related papers (2024-07-24T10:20:19Z) - DETRs Beat YOLOs on Real-time Object Detection [5.426236055184119]
YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy.
Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS.
In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector.
arXiv Detail & Related papers (2023-04-17T08:30:02Z) - Towards End-to-end Semi-supervised Learning for One-stage Object
Detection [88.56917845580594]
This paper focuses on the semi-supervised learning for the advanced and popular one-stage detection network YOLOv5.
We propose a novel teacher-student learning recipe called OneTeacher with two innovative designs, namely Multi-view Pseudo-label Refinement (MPR) and Decoupled Semi-supervised Optimization (DSO)
In particular, MPR improves the quality of pseudo-labels via augmented-view refinement and global-view filtering, and DSO handles the joint optimization conflicts via structure tweaks and task-specific pseudo-labeling.
arXiv Detail & Related papers (2023-02-22T11:35:40Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Semantic-Aligned Matching for Enhanced DETR Convergence and Multi-Scale
Feature Fusion [95.7732308775325]
The proposed DEtection TRansformer (DETR) has established a fully end-to-end paradigm for object detection.
DETR suffers from slow training convergence, which hinders its applicability to various detection tasks.
We design Semantic-Aligned-Matching DETR++ to accelerate DETR's convergence and improve detection performance.
arXiv Detail & Related papers (2022-07-28T15:34:29Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Disentangle Your Dense Object Detector [82.22771433419727]
Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding.
However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold.
We propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art detectors.
arXiv Detail & Related papers (2021-07-07T00:52:16Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.