DEYO: DETR with YOLO for End-to-End Object Detection
- URL: http://arxiv.org/abs/2402.16370v1
- Date: Mon, 26 Feb 2024 07:48:19 GMT
- Title: DEYO: DETR with YOLO for End-to-End Object Detection
- Authors: Haodong Ouyang
- Abstract summary: We introduce the first real-time end-to-end object detection model that utilizes a purely convolutional structure encoder, DETR with YOLO (DEYO)
In the first stage of training, we employ a classic detector, pre-trained with a one-to-many matching strategy, to initialize the backbone and neck of the end-to-end detector.
In the second stage of training, we froze the backbone and neck of the end-to-end detector, necessitating the training of the decoder from scratch.
- Score: 0.0
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The training paradigm of DETRs is heavily contingent upon pre-training their
backbone on the ImageNet dataset. However, the limited supervisory signals
provided by the image classification task and one-to-one matching strategy
result in an inadequately pre-trained neck for DETRs. Additionally, the
instability of matching in the early stages of training engenders
inconsistencies in the optimization objectives of DETRs. To address these
issues, we have devised an innovative training methodology termed step-by-step
training. Specifically, in the first stage of training, we employ a classic
detector, pre-trained with a one-to-many matching strategy, to initialize the
backbone and neck of the end-to-end detector. In the second stage of training,
we froze the backbone and neck of the end-to-end detector, necessitating the
training of the decoder from scratch. Through the application of step-by-step
training, we have introduced the first real-time end-to-end object detection
model that utilizes a purely convolutional structure encoder, DETR with YOLO
(DEYO). Without reliance on any supplementary training data, DEYO surpasses all
existing real-time object detectors in both speed and accuracy. Moreover, the
comprehensive DEYO series can complete its second-phase training on the COCO
dataset using a single 8GB RTX 4060 GPU, significantly reducing the training
expenditure. Source code and pre-trained models are available at
https://github.com/ouyanghaodong/DEYO.
Related papers
- Truncated Consistency Models [57.50243901368328]
Training consistency models requires learning to map all intermediate points along PF ODE trajectories to their corresponding endpoints.
We empirically find that this training paradigm limits the one-step generation performance of consistency models.
We propose a new parameterization of the consistency function and a two-stage training procedure that prevents the truncated-time training from collapsing to a trivial solution.
arXiv Detail & Related papers (2024-10-18T22:38:08Z) - DEYOv3: DETR with YOLO for Real-time Object Detection [0.0]
We propose a new training method called step-by-step training.
In the first stage, the one-to-many pre-trained YOLO detector is used to initialize the end-to-end detector.
In the second stage, the backbone and encoder are consistent with the DETR-like model, but only the detector needs to be trained from scratch.
arXiv Detail & Related papers (2023-09-21T07:49:07Z) - AlignDet: Aligning Pre-training and Fine-tuning in Object Detection [38.256555424079664]
AlignDet is a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies.
It can achieve significant improvements across diverse protocols, such as detection algorithm, model backbone, data setting, and training schedule.
arXiv Detail & Related papers (2023-07-20T17:55:14Z) - Focusing on what to decode and what to train: Efficient Training with
HOI Split Decoders and Specific Target Guided DeNoising [17.268302302974607]
Recent one-stage transformer-based methods achieve notable gains in the Human-object Interaction Detection (HOI) task by leveraging the detection of DETR.
We propose a novel one-stage framework (SOV) which consists of a subject decoder, an object decoder, and a verb decoder.
We propose a novel Specific Target Guided (STG) DeNoising training strategy, which leverages learnable object and verb label embeddings to guide the training and accelerate the training convergence.
arXiv Detail & Related papers (2023-07-05T13:42:31Z) - Architecture, Dataset and Model-Scale Agnostic Data-free Meta-Learning [119.70303730341938]
We propose ePisode cUrriculum inveRsion (ECI) during data-free meta training and invErsion calibRation following inner loop (ICFIL) during meta testing.
ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model.
We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner.
arXiv Detail & Related papers (2023-03-20T15:10:41Z) - Intersection of Parallels as an Early Stopping Criterion [64.8387564654474]
We propose a method to spot an early stopping point in the training iterations without the need for a validation set.
For a wide range of learning rates, our method, called Cosine-Distance Criterion (CDC), leads to better generalization on average than all the methods that we compare against.
arXiv Detail & Related papers (2022-08-19T19:42:41Z) - Effective and Efficient Training for Sequential Recommendation using
Recency Sampling [91.02268704681124]
We propose a novel Recency-based Sampling of Sequences training objective.
We show that the models enhanced with our method can achieve performances exceeding or very close to stateof-the-art BERT4Rec.
arXiv Detail & Related papers (2022-07-06T13:06:31Z) - Activation to Saliency: Forming High-Quality Labels for Unsupervised
Salient Object Detection [54.92703325989853]
We propose a two-stage Activation-to-Saliency (A2S) framework that effectively generates high-quality saliency cues.
No human annotations are involved in our framework during the whole training process.
Our framework reports significant performance compared with existing USOD methods.
arXiv Detail & Related papers (2021-12-07T11:54:06Z) - DETReg: Unsupervised Pretraining with Region Priors for Object Detection [103.93533951746612]
DETReg is a new self-supervised method that pretrains the entire object detection network.
During pretraining, DETReg predicts object localizations to match the localizations from an unsupervised region proposal generator.
It simultaneously aligns the corresponding feature embeddings with embeddings from a self-supervised image encoder.
arXiv Detail & Related papers (2021-06-08T17:39:14Z) - UP-DETR: Unsupervised Pre-training for Object Detection with
Transformers [11.251593386108189]
We propose a novel pretext task named random query patch detection in Unsupervised Pre-training DETR (UP-DETR)
Specifically, we randomly crop patches from the given image and then feed them as queries to the decoder.
UP-DETR significantly boosts the performance of DETR with faster convergence and higher average precision on object detection, one-shot detection and panoptic segmentation.
arXiv Detail & Related papers (2020-11-18T05:16:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.