RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
- URL: http://arxiv.org/abs/2510.25257v1
- Date: Wed, 29 Oct 2025 08:13:17 GMT
- Title: RT-DETRv4: Painlessly Furthering Real-Time Object Detection with Vision Foundation Models
- Authors: Zijun Liao, Yian Zhao, Xin Shan, Yu Yan, Chang Liu, Lei Lu, Xiangyang Ji, Jie Chen,
- Abstract summary: We propose a cost-effective and highly adaptable distillation framework to enhance lightweight object detectors.<n>Our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models.<n>Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
- Score: 48.91205564876609
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Real-time object detection has achieved substantial progress through meticulously designed architectures and optimization strategies. However, the pursuit of high-speed inference via lightweight network designs often leads to degraded feature representation, which hinders further performance improvements and practical on-device deployment. In this paper, we propose a cost-effective and highly adaptable distillation framework that harnesses the rapidly evolving capabilities of Vision Foundation Models (VFMs) to enhance lightweight object detectors. Given the significant architectural and learning objective disparities between VFMs and resource-constrained detectors, achieving stable and task-aligned semantic transfer is challenging. To address this, on one hand, we introduce a Deep Semantic Injector (DSI) module that facilitates the integration of high-level representations from VFMs into the deep layers of the detector. On the other hand, we devise a Gradient-guided Adaptive Modulation (GAM) strategy, which dynamically adjusts the intensity of semantic transfer based on gradient norm ratios. Without increasing deployment and inference overhead, our approach painlessly delivers striking and consistent performance gains across diverse DETR-based models, underscoring its practical utility for real-time detection. Our new model family, RT-DETRv4, achieves state-of-the-art results on COCO, attaining AP scores of 49.7/53.5/55.4/57.0 at corresponding speeds of 273/169/124/78 FPS.
Related papers
- StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models [98.72926158261937]
We propose a training-free token pruning framework for Visual AutoRegressive models.<n>We employ a lightweight high-pass filter to capture local texture details, while leveraging Principal Component Analysis (PCA) to preserve global structural information.<n>To maintain valid next-scale prediction under sparse tokens, we introduce a nearest neighbor feature propagation strategy.
arXiv Detail & Related papers (2026-03-02T11:35:05Z) - SLGNet: Synergizing Structural Priors and Language-Guided Modulation for Multimodal Object Detection [28.779870703756668]
We propose SLGNet, a framework that synergizes hierarchical structural priors and language-guided modulation within a frozen Vision Transformer (ViT)-based foundation model.<n>SLGNet achieves an mAP of 66.1, while reducing trainable parameters by approximately 87% compared to traditional full fine-tuning.
arXiv Detail & Related papers (2026-01-05T16:31:41Z) - PAGen: Phase-guided Amplitude Generation for Domain-adaptive Object Detection [15.55359477953804]
Unsupervised domain adaptation (UDA) greatly facilitates the deployment of neural networks across diverse environments.<n>We present a simple yet effective UDA method that learns to adapt image styles in the frequency domain to reduce the discrepancy between source and target domains.
arXiv Detail & Related papers (2025-11-27T02:22:37Z) - DualGazeNet: A Biologically Inspired Dual-Gaze Query Network for Salient Object Detection [52.32976488996896]
We introduce DualGazeNet, a pure Transformer framework for salient object detection.<n>Experiments on five RGB benchmarks show that DualGazeNet consistently surpasses 25 state-of-the-art CNN- and Transformer-based methods.
arXiv Detail & Related papers (2025-11-24T08:08:22Z) - Source-Free Object Detection with Detection Transformer [59.33653163035064]
Source-Free Object Detection (SFOD) enables knowledge transfer from a source domain to an unsupervised target domain for object detection without access to source data.<n>Most existing SFOD approaches are either confined to conventional object detection (OD) models like Faster R-CNN or designed as general solutions without tailored adaptations for novel OD architectures, especially Detection Transformer (DETR)<n>In this paper, we introduce Feature Reweighting ANd Contrastive Learning NetworK (FRANCK), a novel SFOD framework specifically designed to perform query-centric feature enhancement for DETRs.
arXiv Detail & Related papers (2025-10-13T07:35:04Z) - FMC-DETR: Frequency-Decoupled Multi-Domain Coordination for Aerial-View Object Detection [18.023418423273082]
We propose FMC-DETR, a novel framework with frequency-decoupled fusion for aerial-view object detection.<n>First, we introduce the Wavelet Kolmogorov-Arnold Transformer (WeKat) backbone, which applies cascaded wavelet transforms to enhance global low-frequency context perception.<n>Next, a lightweight Cross-stage Partial Fusion (CPF) module reduces redundancy and improves multi-scale feature interaction.<n>Finally, we introduce the Multi-Domain Feature Coordination (MDFC) module, which unifies spatial, frequency, and structural priors to balance detail preservation and global enhancement.
arXiv Detail & Related papers (2025-09-27T02:28:22Z) - GCRPNet: Graph-Enhanced Contextual and Regional Perception Network for Salient Object Detection in Optical Remote Sensing Images [68.33481681452675]
We propose a graph-enhanced contextual and regional perception network (GCRPNet)<n>It builds upon the Mamba architecture to simultaneously capture long-range dependencies and enhance regional feature representation.<n>It performs adaptive patch scanning on feature maps processed via multi-scale convolutions, thereby capturing rich local region information.
arXiv Detail & Related papers (2025-08-14T11:31:43Z) - YOLOatr : Deep Learning Based Automatic Target Detection and Localization in Thermal Infrared Imagery [0.0]
We propose a modified anchor-based single-stage detector, called YOLOatr, with optimal modifications to the detection heads, feature fusion in the neck, and a custom augmentation profile.<n>We evaluate the performance of our proposed model on a comprehensive DSIAC MWIR dataset for real-time ATR over both correlated and decorrelated testing protocols.
arXiv Detail & Related papers (2025-07-15T12:41:01Z) - Fine-Tuning Florence2 for Enhanced Object Detection in Un-constructed Environments: Vision-Language Model Approach [0.0]
We fine-tuned the Florence2 model for object detection tasks in non-constructed, complex environments.<n> optimized Florence2 models exhibited significant improvements in object detection accuracy.
arXiv Detail & Related papers (2025-03-06T19:31:51Z) - Progressive Self-Guided Loss for Salient Object Detection [102.35488902433896]
We present a progressive self-guided loss function to facilitate deep learning-based salient object detection in images.
Our framework takes advantage of adaptively aggregated multi-scale features to locate and detect salient objects effectively.
arXiv Detail & Related papers (2021-01-07T07:33:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.