Related papers: DEYOv3: DETR with YOLO for Real-time Object Detection

DEYOv3: DETR with YOLO for Real-time Object Detection

URL: http://arxiv.org/abs/2309.11851v2
Date: Fri, 22 Sep 2023 15:25:30 GMT
Title: DEYOv3: DETR with YOLO for Real-time Object Detection
Authors: Haodong Ouyang
Abstract summary: We propose a new training method called step-by-step training. In the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, end-to-end object detectors have gained significant attention from the research community due to their outstanding performance. However, DETR typically relies on supervised pretraining of the backbone on ImageNet, which limits the practical application of DETR and the design of the backbone, affecting the model's potential generalization ability. In this paper, we propose a new training method called step-by-step training. Specifically, in the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector. In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch. Due to this training method, the object detector does not need the additional dataset (ImageNet) to train the backbone, which makes the design of the backbone more flexible and dramatically reduces the training cost of the detector, which is helpful for the practical application of the object detector. At the same time, compared with the DETR-like model, the step-by-step training method can achieve higher accuracy than the traditional training method of the DETR-like model. With the aid of this novel training method, we propose a brand-new end-to-end real-time object detection model called DEYOv3. DEYOv3-N achieves 41.1% on COCO val2017 and 270 FPS on T4 GPU, while DEYOv3-L achieves 51.3% AP and 102 FPS. Without the use of additional training data, DEYOv3 surpasses all existing real-time object detectors in terms of both speed and accuracy. It is worth noting that for models of N, S, and M scales, the training on the COCO dataset can be completed using a single 24GB RTX3090 GPU. Code will be released at https://github.com/ouyanghaodong/DEYOv3.

Related papers

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection [52.66283064389691]
State-of-the-art 3D object detectors are often trained on massive labeled datasets. Recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. We propose a shelf-supervised approach for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data.
arXiv Detail & Related papers (2024-06-14T15:21:57Z)
DEYO: DETR with YOLO for End-to-End Object Detection [0.0]
We introduce the first real-time end-to-end object detection model that utilizes a purely convolutional structure encoder, DETR with YOLO (DEYO) In the first stage of training, we employ a classic detector, pre-trained with a one-to-many matching strategy, to initialize the backbone and neck of the end-to-end detector. In the second stage of training, we froze the backbone and neck of the end-to-end detector, necessitating the training of the decoder from scratch.
arXiv Detail & Related papers (2024-02-26T07:48:19Z)
3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking [15.330384668966806]
State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter. We propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. Our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively.
arXiv Detail & Related papers (2023-08-12T19:19:58Z)
YOLOBench: Benchmarking Efficient Object Detectors on Embedded Systems [0.0873811641236639]
We present YOLOBench, a benchmark comprised of 550+ YOLO-based object detection models on 4 different datasets and 4 different embedded hardware platforms. We collect accuracy and latency numbers for a variety of YOLO-based one-stage detectors at different model scales by performing a fair, controlled comparison of these detectors with a fixed training environment. We evaluate training-free accuracy estimators used in neural architecture search on YOLOBench and demonstrate that, while most state-of-the-art zero-cost accuracy estimators are outperformed by a simple baseline like MAC count, some of them can be effectively used to
arXiv Detail & Related papers (2023-07-26T01:51:10Z)
Weakly Supervised Monocular 3D Object Detection using Multi-View Projection and Direction Consistency [78.76508318592552]
Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application. Most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase. We propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images.
arXiv Detail & Related papers (2023-03-15T15:14:00Z)
Generalized Few-Shot 3D Object Detection of LiDAR Point Cloud for Autonomous Driving [91.39625612027386]
We propose a novel task, called generalized few-shot 3D object detection, where we have a large amount of training data for common (base) objects, but only a few data for rare (novel) classes. Specifically, we analyze in-depth differences between images and point clouds, and then present a practical principle for the few-shot setting in the 3D LiDAR dataset. To solve this task, we propose an incremental fine-tuning method to extend existing 3D detection models to recognize both common and rare objects.
arXiv Detail & Related papers (2023-02-08T07:11:36Z)
MonoPCNS: Monocular 3D Object Detection via Point Cloud Network Simulation [16.237400933896886]
Existing leading methods tend to estimate the depth of the input image first, and detect the 3D object based on point cloud. MonoPCNS is proposed to simulate the feature learning behavior of a point cloud based detector for monocular detector during the training period. Our method consistently improves the performance of different monocular detectors for a large margin without changing their network architectures.
arXiv Detail & Related papers (2022-08-19T16:57:11Z)
Delving into the Pre-training Paradigm of Monocular 3D Object Detection [10.07932482761621]
We study the pre-training paradigm for monocular 3D object detection (M3OD) We propose several strategies to further improve this baseline, which mainly include target guided semi-dense depth estimation, keypoint-aware 2D object detection, and class-level loss adjustment. Combining all the developed techniques, the obtained pre-training framework produces pre-trained backbones that improve M3OD performance significantly on the KITTI-3D and nuScenes benchmarks.
arXiv Detail & Related papers (2022-06-08T03:01:13Z)
A Lightweight and Detector-free 3D Single Object Tracker on Point Clouds [50.54083964183614]
It is non-trivial to perform accurate target-specific detection since the point cloud of objects in raw LiDAR scans is usually sparse and incomplete. We propose DMT, a Detector-free Motion prediction based 3D Tracking network that totally removes the usage of complicated 3D detectors.
arXiv Detail & Related papers (2022-03-08T17:49:07Z)
Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper. In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects. REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z)
2nd Place Scheme on Action Recognition Track of ECCV 2020 VIPriors Challenges: An Efficient Optical Flow Stream Guided Framework [57.847010327319964]
We propose a data-efficient framework that can train the model from scratch on small datasets. Specifically, by introducing a 3D central difference convolution operation, we proposed a novel C3D neural network-based two-stream framework. It is proved that our method can achieve a promising result even without a pre-trained model on large scale datasets.
arXiv Detail & Related papers (2020-08-10T09:50:28Z)

This list is automatically generated from the titles and abstracts of the papers in this site.