DEIM: DETR with Improved Matching for Fast Convergence
- URL: http://arxiv.org/abs/2412.04234v1
- Date: Thu, 05 Dec 2024 15:10:13 GMT
- Title: DEIM: DETR with Improved Matching for Fast Convergence
- Authors: Shihua Huang, Zhichao Lu, Xiaodong Cun, Yongjun Yu, Xiao Zhou, Xi Shen,
- Abstract summary: We introduce DEIM, a training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR)
To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy.
While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance.
- Score: 28.24665757155962
- License:
- Abstract: We introduce DEIM, an innovative and efficient training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR). To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy. This approach increases the number of positive samples per image by incorporating additional targets, using standard data augmentation techniques. While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance. To address this, we propose the Matchability-Aware Loss (MAL), a novel loss function that optimizes matches across various quality levels, enhancing the effectiveness of Dense O2O. Extensive experiments on the COCO dataset validate the efficacy of DEIM. When integrated with RT-DETR and D-FINE, it consistently boosts performance while reducing training time by 50%. Notably, paired with RT-DETRv2, DEIM achieves 53.2% AP in a single day of training on an NVIDIA 4090 GPU. Additionally, DEIM-trained real-time models outperform leading real-time object detectors, with DEIM-D-FINE-L and DEIM-D-FINE-X achieving 54.7% and 56.5% AP at 124 and 78 FPS on an NVIDIA T4 GPU, respectively, without the need for additional data. We believe DEIM sets a new baseline for advancements in real-time object detection. Our code and pre-trained models are available at https://github.com/ShihuaHuang95/DEIM.
Related papers
- YOLOv4: A Breakthrough in Real-Time Object Detection [0.0]
YOLOv4 achieves superior detection in diverse scenarios, attaining 43.5% AP on a Tesla V100 at 65 frames per second.
arXiv Detail & Related papers (2025-02-06T15:45:18Z) - D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement [37.78880948551719]
D-FINE is a powerful real-time object detector that achieves outstanding localization precision.
D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal localization Self-Distillation (GO-LSD)
When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors.
arXiv Detail & Related papers (2024-10-17T17:57:01Z) - DEYOv3: DETR with YOLO for Real-time Object Detection [0.0]
We propose a new training method called step-by-step training.
In the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector.
In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch.
arXiv Detail & Related papers (2023-09-21T07:49:07Z) - Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly
Detectors [117.61449210940955]
We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level.
We introduce an approach to weight tokens based on motion gradients, thus shifting the focus from the static background scene to the foreground objects.
We generate synthetic abnormal events to augment the training videos, and task the masked AE model to jointly reconstruct the original frames.
arXiv Detail & Related papers (2023-06-21T06:18:05Z) - DETRs Beat YOLOs on Real-time Object Detection [5.426236055184119]
YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy.
Recently, end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS.
In this paper, we propose the Real-Time DEtection TRansformer (RT-DETR), the first real-time end-to-end object detector.
arXiv Detail & Related papers (2023-04-17T08:30:02Z) - EdgeYOLO: An Edge-Real-Time Object Detector [69.41688769991482]
This paper proposes an efficient, low-complexity and anchor-free object detector based on the state-of-the-art YOLO framework.
We develop an enhanced data augmentation method to effectively suppress overfitting during training, and design a hybrid random loss function to improve the detection accuracy of small objects.
Our baseline model can reach the accuracy of 50.6% AP50:95 and 69.8% AP50 in MS 2017 dataset, 26.4% AP50:95 and 44.8% AP50 in VisDrone 2019-DET dataset, and it meets real-time requirements (FPS>=30) on edge-computing device Nvidia
arXiv Detail & Related papers (2023-02-15T06:05:14Z) - Distributed Adversarial Training to Robustify Deep Neural Networks at
Scale [100.19539096465101]
Current deep neural networks (DNNs) are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification.
To defend against such attacks, an effective approach, known as adversarial training (AT), has been shown to mitigate robust training.
We propose a large-batch adversarial training framework implemented over multiple machines.
arXiv Detail & Related papers (2022-06-13T15:39:43Z) - EAutoDet: Efficient Architecture Search for Object Detection [110.99532343155073]
EAutoDet framework can discover practical backbone and FPN architectures for object detection in 1.4 GPU-days.
We propose a kernel reusing technique by sharing the weights of candidate operations on one edge and consolidating them into one convolution.
In particular, the discovered architectures surpass state-of-the-art object detection NAS methods and achieve 40.1 mAP with 120 FPS and 49.2 mAP with 41.3 FPS on COCO test-dev set.
arXiv Detail & Related papers (2022-03-21T05:56:12Z) - Fast Convergence of DETR with Spatially Modulated Co-Attention [83.19863907905666]
We propose a simple yet effective scheme for improving the Detection Transformer framework, namely Spatially Modulated Co-Attention (SMCA) mechanism.
Our proposed SMCA increases DETR's convergence speed by replacing the original co-attention mechanism in the decoder.
Our fully-fledged SMCA can achieve better performance compared to DETR with a dilated convolution-based backbone.
arXiv Detail & Related papers (2021-01-19T03:52:44Z) - Dynamic R-CNN: Towards High Quality Object Detection via Dynamic
Training [70.2914594796002]
We propose Dynamic R-CNN to adjust the label assignment criteria and the shape of regression loss function.
Our method improves upon ResNet-50-FPN baseline with 1.9% AP and 5.5% AP$_90$ on the MS dataset with no extra overhead.
arXiv Detail & Related papers (2020-04-13T15:20:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.