DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection
- URL: http://arxiv.org/abs/2507.19807v1
- Date: Sat, 26 Jul 2025 05:40:04 GMT
- Title: DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection
- Authors: Guiping Cao, Xiangyuan Lan, Wenjian Huang, Jianguo Zhang, Dongmei Jiang, Yaowei Wang,
- Abstract summary: We propose DS-Det, a more efficient transformer detector capable of detecting a flexible number of objects in images.<n>Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling.<n>We also propose a simplified decoder framework through attention disentangled learning.
- Score: 39.56089737473775
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Popular transformer detectors have achieved promising performance through query-based learning using attention mechanisms. However, the roles of existing decoder query types (e.g., content query and positional query) are still underexplored. These queries are generally predefined with a fixed number (fixed-query), which limits their flexibility. We find that the learning of these fixed-query is impaired by Recurrent Opposing inTeractions (ROT) between two attention operations: Self-Attention (query-to-query) and Cross-Attention (query-to-encoder), thereby degrading decoder efficiency. Furthermore, "query ambiguity" arises when shared-weight decoder layers are processed with both one-to-one and one-to-many label assignments during training, violating DETR's one-to-one matching principle. To address these challenges, we propose DS-Det, a more efficient detector capable of detecting a flexible number of objects in images. Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling, transforming the fixed-query into flexible. Furthermore, we propose a simplified decoder framework through attention disentangled learning: locating boxes with Cross-Attention (one-to-many process), deduplicating predictions with Self-Attention (one-to-one process), addressing "query ambiguity" and "ROT" issues directly, and enhancing decoder efficiency. We further introduce a unified PoCoo loss that leverages box size priors to prioritize query learning on hard samples such as small objects. Extensive experiments across five different backbone models on COCO2017 and WiderPerson datasets demonstrate the general effectiveness and superiority of DS-Det. The source codes are available at https://github.com/Med-Process/DS-Det/.
Related papers
- Dynamic Object Queries for Transformer-based Incremental Object Detection [45.41291377837515]
Incremental object detection aims to sequentially learn new classes, while maintaining the capability to locate and identify old ones.
Prior methodologies mainly tackle the forgetting issue through knowledge distillation and exemplar replay.
We propose DyQ-DETR, which incrementally expands the model representation ability to achieve stability-plasticity tradeoffs.
arXiv Detail & Related papers (2024-07-31T15:29:34Z) - RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection [26.37802649901314]
Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions.<n>We propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies.<n>Experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2023-11-29T13:43:17Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - Unsupervised and self-adaptative techniques for cross-domain person
re-identification [82.54691433502335]
Person Re-Identification (ReID) across non-overlapping cameras is a challenging task.
Unsupervised Domain Adaptation (UDA) is a promising alternative, as it performs feature-learning adaptation from a model trained on a source to a target domain without identity-label annotation.
In this paper, we propose a novel UDA-based ReID method that takes advantage of triplets of samples created by a new offline strategy.
arXiv Detail & Related papers (2021-03-21T23:58:39Z) - Decoupled and Memory-Reinforced Networks: Towards Effective Feature
Learning for One-Step Person Search [65.51181219410763]
One-step methods have been developed to handle pedestrian detection and identification sub-tasks using a single network.
There are two major challenges in the current one-step approaches.
We propose a decoupled and memory-reinforced network (DMRNet) to overcome these problems.
arXiv Detail & Related papers (2021-02-22T06:19:45Z) - AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection [8.39479809973967]
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce examples.
Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component.
We present that a general few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations.
arXiv Detail & Related papers (2020-11-30T10:21:32Z) - End-to-End Object Detection with Transformers [88.06357745922716]
We present a new method that views object detection as a direct set prediction problem.
Our approach streamlines the detection pipeline, effectively removing the need for many hand-designed components.
The main ingredients of the new framework, called DEtection TRansformer or DETR, are a set-based global loss.
arXiv Detail & Related papers (2020-05-26T17:06:38Z) - FairMOT: On the Fairness of Detection and Re-Identification in Multiple
Object Tracking [92.48078680697311]
Multi-object tracking (MOT) is an important problem in computer vision.
We present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet.
The approach achieves high accuracy for both detection and tracking.
arXiv Detail & Related papers (2020-04-04T08:18:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.