Related papers: DETRs Beat YOLOs on Real-time Object Detection

DETRs Beat YOLOs on Real-time Object Detection

URL: http://arxiv.org/abs/2304.08069v3
Date: Wed, 3 Apr 2024 11:46:48 GMT
Title: DETRs Beat YOLOs on Real-time Object Detection
Authors: Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen,
Abstract summary: YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector.
Score: 5.426236055184119
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However, we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless, the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps, drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed, followed by maintaining speed while improving accuracy. Specifically, we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then, we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder, thereby improving accuracy. In addition, RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU, outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore, RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365, RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR.

Related papers

YOLOv12: Attention-Centric Real-Time Object Detectors [38.507511985479006]
This paper proposes an attention-centric YOLO framework, YOLOv12, that matches the speed of previous CNN-based ones. YOLOv12 surpasses all popular real-time object detectors in accuracy with competitive speed.
arXiv Detail & Related papers (2025-02-18T04:20:14Z)
Fast-COS: A Fast One-Stage Object Detector Based on Reparameterized Attention Vision Transformer for Autonomous Driving [3.617580194719686]
This paper introduces Fast-COS, a novel single-stage object detection framework crafted specifically for driving scenes. RAViT achieves 81.4% Top-1 accuracy on the ImageNet-1K dataset. It surpasses leading models in efficiency, delivering up to 75.9% faster GPU inference and 1.38 higher throughput on edge devices.
arXiv Detail & Related papers (2025-02-11T09:54:09Z)
DEIM: DETR with Improved Matching for Fast Convergence [28.24665757155962]
We introduce DEIM, a training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR) To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance.
arXiv Detail & Related papers (2024-12-05T15:10:13Z)
RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision [7.721101317599364]
We propose a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3. To address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation. RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series.
arXiv Detail & Related papers (2024-09-13T02:02:07Z)
RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer [2.1186155813156926]
RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator.
arXiv Detail & Related papers (2024-07-24T10:20:19Z)
LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z)
RCS-YOLO: A Fast and High-Accuracy Object Detector for Brain Tumor Detection [7.798672884591179]
We propose a novel YOLO architecture based on channel Shuffle (RCS-YOLO) Experimental results on the brain tumor dataset Br35H show that the proposed model surpasses YOLOv6, YOLOv7, and YOLOv8 in speed and accuracy. Our proposed RCS-YOLO achieves state-of-the-art performance on the brain tumor detection task.
arXiv Detail & Related papers (2023-07-31T05:38:17Z)
EdgeYOLO: An Edge-Real-Time Object Detector [69.41688769991482]
This paper proposes an efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework. We develop an enhanced data augmentation method to effectively suppress overfitting during training, and design a hybrid random loss function to improve the detection accuracy of small objects. Our baseline model can reach the accuracy of 50.6% AP50:95 and 69.8% AP50 in MS 2017 dataset, 26.4% AP50:95 and 44.8% AP50 in VisDrone 2019-DET dataset, and it meets real-time requirements (FPS>=30) on edge-computing device Nvidia
arXiv Detail & Related papers (2023-02-15T06:05:14Z)
A lightweight and accurate YOLO-like network for small target detection in Aerial Imagery [94.78943497436492]
We present YOLO-S, a simple, fast and efficient network for small target detection. YOLO-S exploits a small feature extractor based on Darknet20, as well as skip connection, via both bypass and concatenation. YOLO-S has an 87% decrease of parameter size and almost one half FLOPs of YOLOv3, making practical the deployment for low-power industrial applications.
arXiv Detail & Related papers (2022-04-05T16:29:49Z)
Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper. In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects. REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z)
Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism. Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder. Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-01-19T03:52:44Z)
FastEmit: Low-latency Streaming ASR with Sequence-level Emission Regularization [78.46088089185156]
Streaming automatic speech recognition (ASR) aims to emit each hypothesized word as quickly and accurately as possible. Existing approaches penalize emission delay by manipulating per-token or per-frame probability prediction in sequence transducer models. We propose a sequence-level emission regularization method, named FastEmit, that applies latency regularization directly on per-sequence probability in training transducer models.
arXiv Detail & Related papers (2020-10-21T17:05:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.