Related papers: RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection

RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection

URL: http://arxiv.org/abs/2311.17629v4
Date: Mon, 16 Dec 2024 14:56:52 GMT
Title: RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection
Authors: Jiaqi Zhao, Zeyu Ding, Yong Zhou, Hancheng Zhu, Wenliang Du, Rui Yao, Abdulmotaleb El Saddik,
Abstract summary: Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions.<n>We propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies.<n>Experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method.
Score: 26.37802649901314
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions. Recently, end-to-end detectors have made significant strides by employing attention mechanisms and refining a fixed number of queries through consecutive decoder layers. However, existing end-to-end oriented object detectors still face two primary challenges: 1) misalignment between positional queries and keys, leading to inconsistency between classification and localization; and 2) the presence of a large number of similar queries, which complicates one-to-one label assignments and optimization. To address these limitations, we propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies: Rotated RoI Attention (RRoI Attention) and Selective Distinct Queries (SDQ). First, RRoI Attention aligns positional queries and keys from oriented regions of interest through cross-attention. Second, SDQ collects queries from intermediate decoder layers and filters out similar ones to generate distinct queries, thereby facilitating the optimization of one-to-one label assignments. Finally, extensive experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method. To further validate its generalization capability, we also extend our approach to horizontal object detection The code is available at \url{https://github.com/wokaikaixinxin/RQFormer}.

Related papers

Dual-Stream Attention with Multi-Modal Queries for Object Detection in Transportation Applications [6.603505460200282]
Transformer-based object detectors often struggle with occlusions, fine-grained localization, and computational inefficiency caused by fixed queries and dense attention.<n>We propose DAMM, Dual-stream Attention with Multi-Modal queries, a novel framework introducing both query adaptation and structured cross-attention for improved accuracy and efficiency.
arXiv Detail & Related papers (2025-08-06T20:37:24Z)
DS-Det: Single-Query Paradigm and Attention Disentangled Learning for Flexible Object Detection [39.56089737473775]
We propose DS-Det, a more efficient transformer detector capable of detecting a flexible number of objects in images.<n>Specifically, we reformulate and introduce a new unified Single-Query paradigm for decoder modeling.<n>We also propose a simplified decoder framework through attention disentangled learning.
arXiv Detail & Related papers (2025-07-26T05:40:04Z)
Dense Object Detection Based on De-homogenized Queries [12.33849715319161]
Dense object detection is widely used in automatic driving, video surveillance, and other fields. Currently, detection methods based on greedy algorithms, such as non-maximum suppression (NMS), often produce many repetitive predictions or missed detections in dense scenarios. Through the end-to-end DETR (DEtection TRansformer), as a type of detector that can incorporate the post-processing de-duplication capability of NMS, etc., into the network, we found that homogeneous queries in the query-based detector lead to a reduction in the de-duplication capability of the network and the learning efficiency of the encoder
arXiv Detail & Related papers (2025-02-11T02:36:10Z)
OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images [26.37802649901314]
Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation. We propose an end-to-end transformer-based oriented object detector consisting of three dedicated modules to address these issues. Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_50$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$times$ to 1$times$.
arXiv Detail & Related papers (2024-09-29T10:36:33Z)
Renormalized Connection for Scale-preferred Object Detection in Satellite Imagery [51.83786195178233]
We design a Knowledge Discovery Network (KDN) to implement the renormalization group theory in terms of efficient feature extraction. Renormalized connection (RC) on the KDN enables synergistic focusing'' of multi-scale features. RCs extend the multi-level feature's divide-and-conquer'' mechanism of the FPN-based detectors to a wide range of scale-preferred tasks.
arXiv Detail & Related papers (2024-09-09T13:56:22Z)
Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer [12.042768320132694]
Table detection within document images is a crucial task in document processing, involving the identification and localization of tables. Recent strides in deep learning have substantially improved the accuracy of this task, but it still relies on large labeled datasets for effective training. We introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features.
arXiv Detail & Related papers (2024-04-30T20:25:57Z)
Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement [19.277560848076984]
Two-stage selection strategies result in scale bias and redundancy due to mismatch between selected queries and objects. We propose hierarchical salience filtering refinement, which performs transformer encoding only on filtered discriminative queries. The proposed Salience DETR achieves significant improvements of +4.0% AP, +0.2% AP, +4.4% AP on three challenging task-specific detection datasets.
arXiv Detail & Related papers (2024-03-24T13:01:57Z)
PETDet: Proposal Enhancement for Two-Stage Fine-Grained Object Detection [26.843891792018447]
We present PETDet (Proposal Enhancement for Two-stage fine-grained object detection) to better handle the sub-tasks in two-stage FGOD methods. An anchor-free Quality Oriented Proposal Network (QOPN) is proposed with dynamic label assignment and attention-based decomposition. A novel Adaptive Recognition Loss (ARL) offers guidance for the R-CNN head to focus on high-quality proposals.
arXiv Detail & Related papers (2023-12-16T18:04:56Z)
Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection. First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network. Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z)
End-to-End Lane detection with One-to-Several Transformer [6.79236957488334]
O2SFormer converges 12.5x faster than DETR for the ResNet18 backbone. O2SFormer with ResNet50 backbone achieves 77.83% F1 score on CULane dataset, outperforming existing Transformer-based and CNN-based detectors.
arXiv Detail & Related papers (2023-05-01T06:07:11Z)
Adaptive Rotated Convolution for Rotated Object Detection [96.94590550217718]
We present Adaptive Rotated Convolution (ARC) module to handle rotated object detection problem. In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images. The proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP.
arXiv Detail & Related papers (2023-03-14T11:53:12Z)
Enhanced Training of Query-Based Object Detection via Selective Query Recollection [35.3219210570517]
This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage. We design and present Selective Query Recollection, a simple and effective training strategy for query-based object detectors.
arXiv Detail & Related papers (2022-12-15T02:45:57Z)
End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time. Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z)
Dynamic Focus-aware Positional Queries for Semantic Segmentation [94.6834904076914]
We propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries. Our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones.
arXiv Detail & Related papers (2022-04-04T05:16:41Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos. We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z)
CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote Sensing Images [0.9462808515258465]
In this paper, we discuss the role of discriminative features in object detection. We then propose a Critical Feature Capturing Network (CFC-Net) to improve detection accuracy. We show that our method achieves superior detection performance compared with many state-of-the-art approaches.
arXiv Detail & Related papers (2021-01-18T02:31:09Z)
MRDet: A Multi-Head Network for Accurate Oriented Object Detection in Aerial Images [51.227489316673484]
We propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors. To obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network. Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately.
arXiv Detail & Related papers (2020-12-24T06:36:48Z)
AFD-Net: Adaptive Fully-Dual Network for Few-Shot Object Detection [8.39479809973967]
Few-shot object detection (FSOD) aims at learning a detector that can fast adapt to previously unseen objects with scarce examples. Existing methods solve this problem by performing subtasks of classification and localization utilizing a shared component. We present that a general few-shot detector should consider the explicit decomposition of two subtasks, as well as leveraging information from both of them to enhance feature representations.
arXiv Detail & Related papers (2020-11-30T10:21:32Z)
Anchor-free Small-scale Multispectral Pedestrian Detection [88.7497134369344]
We propose a method for effective and efficient multispectral fusion of the two modalities in an adapted single-stage anchor-free base architecture. We aim at learning pedestrian representations based on object center and scale rather than direct bounding box predictions. Results show our method's effectiveness in detecting small-scaled pedestrians.
arXiv Detail & Related papers (2020-08-19T13:13:01Z)
FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking [92.48078680697311]
Multi-object tracking (MOT) is an important problem in computer vision. We present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet. The approach achieves high accuracy for both detection and tracking.
arXiv Detail & Related papers (2020-04-04T08:18:00Z)

This list is automatically generated from the titles and abstracts of the papers in this site.