Route-DETR: Pairwise Query Routing in Transformers for Object Detection
- URL: http://arxiv.org/abs/2512.13876v1
- Date: Mon, 15 Dec 2025 20:26:58 GMT
- Title: Route-DETR: Pairwise Query Routing in Transformers for Object Detection
- Authors: Ye Zhang, Qi Chen, Wenyou Huang, Rui Liu, Zhengjian Kang,
- Abstract summary: Detection Transformer (DETR) offers an end-to-end solution for object detection.<n>DETR suffers from inefficient query competition where multiple queries converge to similar positions.<n>We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers.
- Score: 11.46025964297103
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Detection Transformer (DETR) offers an end-to-end solution for object detection by eliminating hand-crafted components like non-maximum suppression. However, DETR suffers from inefficient query competition where multiple queries converge to similar positions, leading to redundant computations. We present Route-DETR, which addresses these issues through adaptive pairwise routing in decoder self-attention layers. Our key insight is distinguishing between competing queries (targeting the same object) versus complementary queries (targeting different objects) using inter-query similarity, confidence scores, and geometry. We introduce dual routing mechanisms: suppressor routes that modulate attention between competing queries to reduce duplication, and delegator routes that encourage exploration of different regions. These are implemented via learnable low-rank attention biases enabling asymmetric query interactions. A dual-branch training strategy incorporates routing biases only during training while preserving standard attention for inference, ensuring no additional computational cost. Experiments on COCO and Cityscapes demonstrate consistent improvements across multiple DETR baselines, achieving +1.7% mAP gain over DINO on ResNet-50 and reaching 57.6% mAP on Swin-L, surpassing prior state-of-the-art models.
Related papers
- TopoCurate:Modeling Interaction Topology for Tool-Use Agent Training [53.93696896939915]
Training tool-use agents typically rely on Supervised Fine-Tuning (SFT) on successful trajectories and Reinforcement Learning (RL) on pass-rate-selected tasks.<n>We propose TopoCurate, an interaction-aware framework that projects multi-trial rollouts from the same task into a unified semantic quotient topology.<n>TopoCurate achieves consistent gains of 4.2% (SFT) and 6.9% (RL) over state-of-the-art baselines.
arXiv Detail & Related papers (2026-03-02T10:38:54Z) - TCAndon-Router: Adaptive Reasoning Router for Multi-Agent Collaboration [0.9564467981235256]
Multi-Agent Systems (MAS) have become a powerful paradigm for building high performance intelligent applications.<n>Within these systems, the router responsible for determining which expert agents should handle a given query plays a crucial role in overall performance.<n>To address these challenges, we propose TCAndon-TCAR: an adaptive reasoning router for multi-agent collaboration.<n>Experiments on public datasets and real enterprise data demonstrate that TCAR significantly improves routing accuracy, reduces routing conflicts, and remains robust in ambiguous scenarios.
arXiv Detail & Related papers (2026-01-08T03:17:33Z) - GCR: Geometry-Consistent Routing for Task-Agnostic Continual Anomaly Detection [5.925242470217676]
We propose a lightweight mixture-of-experts framework for stabilizing continual anomaly detection through geometry-consistent routing.<n>GCR routes each test image directly in a shared patch-embedding space by minimizing an accumulated nearest-prototype distance to category-specific prototype banks.<n> Experiments on MVTec AD and VisA show that geometry-consistent routing substantially improves routing stability and mitigates continual performance collapse.
arXiv Detail & Related papers (2026-01-05T07:33:50Z) - Robust Nearest Neighbour Retrieval Using Targeted Manifold Manipulation [0.0]
Nearest-neighbour retrieval is central to classification and explainable-AI pipelines.<n>We propose Targeted Manifold Manipulation-Nearest Neighbour (TMM-NN), which reconceptualises retrieval by assessing how readily each sample can be nudged into a designated region of the feature manifold.<n>TMM-NN implements this through a lightweight, query-specific trigger patch.
arXiv Detail & Related papers (2025-11-09T07:37:05Z) - Fractional Correspondence Framework in Detection Transformer [13.388933240897492]
The Detection Transformer (DETR) has significantly simplified the matching process in object detection tasks.<n>This algorithm facilitates optimal one-to-one matching of predicted bounding boxes to ground-truth annotations during training.<n>We propose a flexible matching strategy that captures the cost of aligning predictions with ground truths to find the most accurate correspondences.
arXiv Detail & Related papers (2025-03-06T05:29:20Z) - AdvQDet: Detecting Query-Based Adversarial Attacks with Adversarial Contrastive Prompt Tuning [93.77763753231338]
Adversarial Contrastive Prompt Tuning (ACPT) is proposed to fine-tune the CLIP image encoder to extract similar embeddings for any two intermediate adversarial queries.
We show that ACPT can detect 7 state-of-the-art query-based attacks with $>99%$ detection rate within 5 shots.
We also show that ACPT is robust to 3 types of adaptive attacks.
arXiv Detail & Related papers (2024-08-04T09:53:50Z) - Semi-DETR: Semi-Supervised Object Detection with Detection Transformers [105.45018934087076]
We analyze the DETR-based framework on semi-supervised object detection (SSOD)
We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector.
Our method outperforms all state-of-the-art methods by clear margins.
arXiv Detail & Related papers (2023-07-16T16:32:14Z) - Detection Transformer with Stable Matching [48.963171068785435]
We show that the most important design is to use and only use positional metrics to supervise classification scores of positive examples.
Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost.
We achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 12 epochs and 24 epochs training settings.
arXiv Detail & Related papers (2023-04-10T17:55:37Z) - ReAct: Temporal Action Detection with Relational Queries [84.76646044604055]
This work aims at advancing temporal action detection (TAD) using an encoder-decoder framework with action queries.
We first propose a relational attention mechanism in the decoder, which guides the attention among queries based on their relations.
Lastly, we propose to predict the localization quality of each action query at inference in order to distinguish high-quality queries.
arXiv Detail & Related papers (2022-07-14T17:46:37Z) - Accelerating DETR Convergence via Semantic-Aligned Matching [50.3633635846255]
This paper presents SAM-DETR, a Semantic-Aligned-Matching DETR that greatly accelerates DETR's convergence without sacrificing its accuracy.
It explicitly searches salient points with the most discriminative features for semantic-aligned matching, which further speeds up the convergence and boosts detection accuracy as well.
arXiv Detail & Related papers (2022-03-14T06:50:51Z) - Pruning Self-attentions into Convolutional Layers in Single Path [89.55361659622305]
Vision Transformers (ViTs) have achieved impressive performance over various computer vision tasks.
We propose Single-Path Vision Transformer pruning (SPViT) to efficiently and automatically compress the pre-trained ViTs.
Our SPViT can trim 52.0% FLOPs for DeiT-B and get an impressive 0.6% top-1 accuracy gain simultaneously.
arXiv Detail & Related papers (2021-11-23T11:35:54Z) - RelationTrack: Relation-aware Multiple Object Tracking with Decoupled
Representation [3.356734463419838]
Existing online multiple object tracking (MOT) algorithms often consist of two subtasks, detection and re-identification (ReID)
In order to enhance the inference speed and reduce the complexity, current methods commonly integrate these double subtasks into a unified framework.
We devise a module named Global Context Disentangling (GCD) that decouples the learned representation into detection-specific and ReID-specific embeddings.
To resolve this restriction, we develop a module, referred to as Guided Transformer (GTE), by combining the powerful reasoning ability of Transformer encoder and deformable attention.
arXiv Detail & Related papers (2021-05-10T13:00:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.