Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
- URL: http://arxiv.org/abs/2602.21010v1
- Date: Tue, 24 Feb 2026 15:29:55 GMT
- Title: Le-DETR: Revisiting Real-Time Detection Transformer with Efficient Encoder Design
- Authors: Jiannan Huang, Aditya Kane, Fengzhe Zhou, Yunchao Wei, Humphrey Shi,
- Abstract summary: We present Le-DETR (textbfLow-cost and textbfEfficient textbfDEtection textbfTRansformer)<n>It achieves a new textbfSOTA in real-time detection using only ImageNet1K and COCO 2017 training datasets.<n>It surpasses YOLOv12-L/X by textbf+0.6/-0.1 mAP while achieving similar speed and textbf+20% speedup.
- Score: 72.55935017828891
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time object detection is crucial for real-world applications as it requires high accuracy with low latency. While Detection Transformers (DETR) have demonstrated significant performance improvements, current real-time DETR models are challenging to reproduce from scratch due to excessive pre-training overheads on the backbone, constraining research advancements by hindering the exploration of novel backbone architectures. In this paper, we want to show that by using general good design, it is possible to have \textbf{high performance} with \textbf{low pre-training cost}. After a thorough study of the backbone architecture, we propose EfficientNAT at various scales, which incorporates modern efficient convolution and local attention mechanisms. Moreover, we re-design the hybrid encoder with local attention, significantly enhancing both performance and inference speed. Based on these advancements, we present Le-DETR (\textbf{L}ow-cost and \textbf{E}fficient \textbf{DE}tection \textbf{TR}ansformer), which achieves a new \textbf{SOTA} in real-time detection using only ImageNet1K and COCO2017 training datasets, saving about 80\% images in pre-training stage compared with previous methods. We demonstrate that with well-designed, real-time DETR models can achieve strong performance without the need for complex and computationally expensive pretraining. Extensive experiments show that Le-DETR-M/L/X achieves \textbf{52.9/54.3/55.1 mAP} on COCO Val2017 with \textbf{4.45/5.01/6.68 ms} on an RTX4090. It surpasses YOLOv12-L/X by \textbf{+0.6/-0.1 mAP} while achieving similar speed and \textbf{+20\%} speedup. Compared with DEIM-D-FINE, Le-DETR-M achieves \textbf{+0.2 mAP} with slightly faster inference, and surpasses DEIM-D-FINE-L by \textbf{+0.4 mAP} with only \textbf{0.4 ms} additional latency. Code and weights will be open-sourced.
Related papers
- Lightweight Test-Time Adaptation for EMG-Based Gesture Recognition [2.414036142474149]
We propose a lightweight framework for Test-Time Adaptation (TTA) using a Temporal Convolutional Network (TCN) backbone.<n>We introduce three deployment-ready strategies: causal adaptive batch normalization for real-time statistical alignment; (ii) a Gaussian Mixture Model (GMM) alignment with experience replay to prevent forgetting; and (iii) meta-learning for rapid, few-shot calibration.<n>Our results show that experience-replay updates yield superior stability under limited data, while meta-learning achieves competitive performance in one- and two-shot regimes.
arXiv Detail & Related papers (2026-01-07T18:48:31Z) - YOLO-Master: MOE-Accelerated with Specialized Transformers for Enhanced Real-time Detection [26.013463778761317]
YOLO-Master is a novel YOLO-like framework that introduces instance-conditional adaptive computation for Real-Time Object Detection.<n>Our model achieves 42.4% AP with 1.62ms latency, outperforming YOLOv13-N by +0.8% mAP and 17.8% faster inference.
arXiv Detail & Related papers (2025-12-29T07:54:49Z) - RTS-Mono: A Real-Time Self-Supervised Monocular Depth Estimation Method for Real-World Deployment [10.19871006168469]
RTS-Mono is a lightweight and efficient encoder-decoder architecture.<n>It achieves state-of-the-art (SoTA) performance in high and low resolutions.<n>It can perform real-time inference on Nvidia Jetson Orin at a speed of 49 FPS.
arXiv Detail & Related papers (2025-11-18T03:47:04Z) - PPMStereo: Pick-and-Play Memory Construction for Consistent Dynamic Stereo Matching [51.98089287914147]
textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.<n>Inspired by the two-stage decision-making process in humans, we propose a textbfPick-and-textbflay textbfMemory (PM) construction module for dynamic bfStereo matching, dubbed as bftextPPMStereo.
arXiv Detail & Related papers (2025-10-23T03:52:39Z) - Flexiffusion: Training-Free Segment-Wise Neural Architecture Search for Efficient Diffusion Models [50.260693393896716]
Diffusion models (DMs) are powerful generative models capable of producing high-fidelity images but constrained by high computational costs.<n>We propose Flexiffusion, a training-free NAS framework that jointly optimize generation schedules and model architectures without modifying pre-trained parameters.<n>Our work pioneers a resource-efficient paradigm for searching high-speed DMs without sacrificing quality.
arXiv Detail & Related papers (2025-06-03T06:02:50Z) - Output Scaling: YingLong-Delayed Chain of Thought in a Large Pretrained Time Series Forecasting Model [55.25659103706409]
This framework achieves state-of-the-art performance for our designed foundation model, YingLong.<n>YingLong is a non-causal, bidirectional attention encoder-only transformer trained through masked token recovery.<n>We release four foundation models ranging from 6M to 300M parameters, demonstrating superior results in zero-shot tasks.
arXiv Detail & Related papers (2025-05-20T14:31:06Z) - PiT: Progressive Diffusion Transformer [50.46345527963736]
Diffusion Transformers (DiTs) achieve remarkable performance within image generation via the transformer architecture.<n>We find that DiTs do not rely as heavily on global information as previously believed.<n>We propose a series of Pseudo Progressive Diffusion Transformer (PiT)
arXiv Detail & Related papers (2025-05-19T15:02:33Z) - TSPulse: Dual Space Tiny Pre-Trained Models for Rapid Time-Series Analysis [12.034816114258803]
TSPulse is an ultra-compact time-series pre-trained model with only 1M parameters.<n>It performs strongly across classification, anomaly detection, imputation, and retrieval tasks.<n>Results are achieved with just 1M parameters (10-100X smaller than existing SOTA models)
arXiv Detail & Related papers (2025-05-19T12:18:53Z) - Ultra-Resolution Adaptation with Ease [62.56434979517156]
We propose a set of key guidelines for ultra-resolution adaptation termed emphURAE.<n>We show that tuning minor components of the weight matrices outperforms widely-used low-rank adapters when synthetic data are unavailable.<n>Experiments validate that URAE achieves comparable 2K-generation performance to state-of-the-art closed-source models like FLUX1.1 [Pro] Ultra with only 3K samples and 2K iterations.
arXiv Detail & Related papers (2025-03-20T16:44:43Z) - DEIM: DETR with Improved Matching for Fast Convergence [28.24665757155962]
We introduce DEIM, a training framework designed to accelerate convergence in real-time object detection with Transformer-based architectures (DETR)<n>To mitigate the sparse supervision inherent in one-to-one (O2O) matching in DETR models, DEIM employs a Dense O2O matching strategy.<n>While Dense O2O matching speeds up convergence, it also introduces numerous low-quality matches that could affect performance.
arXiv Detail & Related papers (2024-12-05T15:10:13Z) - RMT: Retentive Networks Meet Vision Transformers [55.76528783956601]
Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years.<n>Self-Attention lacks explicit spatial priors and bears a quadratic computational complexity.<n>We propose RMT, a strong vision backbone with explicit spatial prior for general purposes.
arXiv Detail & Related papers (2023-09-20T00:57:48Z) - Adaptive Sparsity Level during Training for Efficient Time Series Forecasting with Transformers [20.23085795744602]
We propose textbfAdaptive textbfSparsity textbfLevel (textbfPALS) to automatically seek a decent balance between loss and sparsity.
PALS draws inspiration from sparse training and during-training methods.
It introduces the novel "expand" mechanism in training sparse neural networks, allowing the model to dynamically shrink, expand, or remain stable to find a proper sparsity level.
arXiv Detail & Related papers (2023-05-28T06:57:27Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z) - Conditional DETR for Fast Training Convergence [76.95358216461524]
We present a conditional cross-attention mechanism for fast DETR training.
Our approach is motivated by that the cross-attention in DETR relies highly on the content embeddings for localizing the four extremities.
We show that conditional DETR converges 6.7x faster for the backbones R50 and R101 and 10x faster for stronger backbones DC5-R50 and DC5-R101.
arXiv Detail & Related papers (2021-08-13T10:07:46Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.