RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer
- URL: http://arxiv.org/abs/2407.17140v1
- Date: Wed, 24 Jul 2024 10:20:19 GMT
- Title: RT-DETRv2: Improved Baseline with Bag-of-Freebies for Real-Time Detection Transformer
- Authors: Wenyu Lv, Yian Zhao, Qinyao Chang, Kui Huang, Guanzhong Wang, Yi Liu,
- Abstract summary: RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR.
To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales.
To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator.
- Score: 2.1186155813156926
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In this report, we present RT-DETRv2, an improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 builds upon the previous state-of-the-art real-time detector, RT-DETR, and opens up a set of bag-of-freebies for flexibility and practicality, as well as optimizing the training strategy to achieve enhanced performance. To improve the flexibility, we suggest setting a distinct number of sampling points for features at different scales in the deformable attention to achieve selective multi-scale feature extraction by the decoder. To enhance practicality, we propose an optional discrete sampling operator to replace the grid_sample operator that is specific to RT-DETR compared to YOLOs. This removes the deployment constraints typically associated with DETRs. For the training strategy, we propose dynamic data augmentation and scale-adaptive hyperparameters customization to improve performance without loss of speed. Source code and pre-trained models will be available at https://github.com/lyuwenyu/RT-DETR.
Related papers
- RemoteVAR: Autoregressive Visual Modeling for Remote Sensing Change Detection [52.32112533846212]
Remote sensing change detection is central to applications such as environmental monitoring and disaster assessment.<n>Visual autoregressive models have recently shown impressive image generation capability, but their adoption for pixel-level discriminative tasks remains limited due to weak controllability, suboptimal dense prediction performance and exposure bias.<n>We introduce a new VAR-based change detection framework that addresses these limitations by conditioning autoregressive prediction on multi-resolution fused bi-temporal features via cross-attention, and by employing an autoregressive training strategy designed specifically for change map prediction.
arXiv Detail & Related papers (2026-01-17T03:50:00Z) - RAPID^3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformer [86.57077884971478]
Diffusion Transformers (DiTs) excel at visual generation yet remain hampered by slow sampling.<n>We introduce RAPID3: Tri-Level Reinforced Acceleration Policies for Diffusion Transformers.<n>It delivers image-wise acceleration with zero updates to the base generator.<n>It achieves nearly 3x faster sampling with competitive generation quality.
arXiv Detail & Related papers (2025-09-26T13:20:52Z) - TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs [67.55973229034319]
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks.<n>We show that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-09-22T17:30:15Z) - RTTC: Reward-Guided Collaborative Test-Time Compute [0.9208007322096533]
Test-Time Compute (TTC) has emerged as a powerful paradigm for enhancing the performance of Large Language Models (LLMs) at inference.<n>We introduce Reward-Guided Test-Time Compute (RTTC), a novel framework that adaptively selects the most effective TTC strategy for each query.<n>RTTC operates in a distributed server-client architecture, retrieving relevant samples from a remote knowledge base and applying RAG or lightweight fine-tuning on client devices only when necessary.
arXiv Detail & Related papers (2025-08-07T21:18:52Z) - SkipVAR: Accelerating Visual Autoregressive Modeling via Adaptive Frequency-Aware Skipping [30.85025293160079]
High-frequency components, or later steps, in the generation process contribute disproportionately to inference latency.<n>We identify two primary sources of inefficiency: step redundancy and unconditional branch redundancy.<n>We propose an automatic step-skipping strategy that selectively omits unnecessary generation steps to improve efficiency.
arXiv Detail & Related papers (2025-06-10T15:35:29Z) - AdaptSR: Low-Rank Adaptation for Efficient and Scalable Real-World Super-Resolution [50.584551250242235]
AdaptSR is a low-rank adaptation framework that efficiently repurposes bi-cubic-trained SR models for real-world tasks.
Our experiments demonstrate that AdaptSR outperforms GAN and diffusion-based SR methods by up to 4 dB in PSNR and 2% in perceptual scores on real SR benchmarks.
arXiv Detail & Related papers (2025-03-10T18:03:18Z) - ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.
Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z) - UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation [12.511829774226113]
We propose an ultra-lightweight (1M) visual-inertial odometry (VIO) network capable of test-time adaptation (TTA) based on visual-inertial consistency.
It achieves 36X smaller network size than state-of-the-art with a minute increase in error -- 1% on the KITTI dataset.
arXiv Detail & Related papers (2024-09-19T22:24:14Z) - RT-DETRv3: Real-time End-to-End Object Detection with Hierarchical Dense Positive Supervision [7.721101317599364]
We propose a hierarchical dense positive supervision method based on RT-DETR, named RT-DETRv3.
To address insufficient decoder training, we propose a novel learning strategy involving self-attention perturbation.
RT-DETRv3 significantly outperforms existing real-time detectors, including the RT-DETR series and the YOLO series.
arXiv Detail & Related papers (2024-09-13T02:02:07Z) - Cascaded Temporal Updating Network for Efficient Video Super-Resolution [47.63267159007611]
Key components in recurrent-based VSR networks significantly impact model efficiency.
We propose a cascaded temporal updating network (CTUN) for efficient VSR.
CTUN achieves a favorable trade-off between efficiency and performance compared to existing methods.
arXiv Detail & Related papers (2024-08-26T12:59:32Z) - PRformer: Pyramidal Recurrent Transformer for Multivariate Time Series Forecasting [82.03373838627606]
Self-attention mechanism in Transformer architecture requires positional embeddings to encode temporal order in time series prediction.
We argue that this reliance on positional embeddings restricts the Transformer's ability to effectively represent temporal sequences.
We present a model integrating PRE with a standard Transformer encoder, demonstrating state-of-the-art performance on various real-world datasets.
arXiv Detail & Related papers (2024-08-20T01:56:07Z) - REP: Resource-Efficient Prompting for On-device Continual Learning [23.92661395403251]
On-device continual learning (CL) requires the co-optimization of model accuracy and resource efficiency to be practical.
It is commonly believed that CNN-based CL excels in resource efficiency, whereas ViT-based CL is superior in model performance.
We introduce REP, which improves resource efficiency specifically targeting prompt-based rehearsal-free methods.
arXiv Detail & Related papers (2024-06-07T09:17:33Z) - LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection [63.780355815743135]
We present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection.
The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder.
arXiv Detail & Related papers (2024-06-05T17:07:24Z) - VTR: An Optimized Vision Transformer for SAR ATR Acceleration on FPGA [2.8595179027282907]
Vision Transformers (ViTs) are the current state-of-the-art in various computer vision applications.
We develop a lightweight ViT model that can be trained directly on small datasets without any pre-training.
We evaluate our proposed model, that we call VTR (ViT for SAR ATR) on three widely used SAR datasets.
arXiv Detail & Related papers (2024-04-06T06:49:55Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF [6.135925201075925]
We propose the dynamic PlenOctree DOT, which adaptively refines the sample distribution to adjust to changing scene complexity.
Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over $55.15$/$68.84%$ parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks $&$ Temples, respectively.
arXiv Detail & Related papers (2023-07-28T06:21:42Z) - RTFormer: Efficient Design for Real-Time Semantic Segmentation with
Transformer [63.25665813125223]
We propose RTFormer, an efficient dual-resolution transformer for real-time semantic segmenation.
It achieves better trade-off between performance and efficiency than CNN-based models.
Experiments on mainstream benchmarks demonstrate the effectiveness of our proposed RTFormer.
arXiv Detail & Related papers (2022-10-13T16:03:53Z) - Recurrent Glimpse-based Decoder for Detection with Transformer [85.64521612986456]
We introduce a novel REcurrent Glimpse-based decOder (REGO) in this paper.
In particular, the REGO employs a multi-stage recurrent processing structure to help the attention of DETR gradually focus on foreground objects.
REGO consistently boosts the performance of different DETR detectors by up to 7% relative gain at the same setting of 50 training epochs.
arXiv Detail & Related papers (2021-12-09T00:29:19Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.