RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection
- URL: http://arxiv.org/abs/2311.17629v4
- Date: Mon, 16 Dec 2024 14:56:52 GMT
- Title: RQFormer: Rotated Query Transformer for End-to-End Oriented Object Detection
- Authors: Jiaqi Zhao, Zeyu Ding, Yong Zhou, Hancheng Zhu, Wenliang Du, Rui Yao, Abdulmotaleb El Saddik,
- Abstract summary: Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions.
We propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies.
Experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method.
- Score: 26.37802649901314
- License:
- Abstract: Oriented object detection presents a challenging task due to the presence of object instances with multiple orientations, varying scales, and dense distributions. Recently, end-to-end detectors have made significant strides by employing attention mechanisms and refining a fixed number of queries through consecutive decoder layers. However, existing end-to-end oriented object detectors still face two primary challenges: 1) misalignment between positional queries and keys, leading to inconsistency between classification and localization; and 2) the presence of a large number of similar queries, which complicates one-to-one label assignments and optimization. To address these limitations, we propose an end-to-end oriented detector called the Rotated Query Transformer, which integrates two key technologies: Rotated RoI Attention (RRoI Attention) and Selective Distinct Queries (SDQ). First, RRoI Attention aligns positional queries and keys from oriented regions of interest through cross-attention. Second, SDQ collects queries from intermediate decoder layers and filters out similar ones to generate distinct queries, thereby facilitating the optimization of one-to-one label assignments. Finally, extensive experiments conducted on four remote sensing datasets and one scene text dataset demonstrate the effectiveness of our method. To further validate its generalization capability, we also extend our approach to horizontal object detection The code is available at \url{https://github.com/wokaikaixinxin/RQFormer}.
Related papers
- Dense Object Detection Based on De-homogenized Queries [12.33849715319161]
Dense object detection is widely used in automatic driving, video surveillance, and other fields.
Currently, detection methods based on greedy algorithms, such as non-maximum suppression (NMS), often produce many repetitive predictions or missed detections in dense scenarios.
Through the end-to-end DETR (DEtection TRansformer), as a type of detector that can incorporate the post-processing de-duplication capability of NMS, etc., into the network, we found that homogeneous queries in the query-based detector lead to a reduction in the de-duplication capability of the network and the learning efficiency of the encoder
arXiv Detail & Related papers (2025-02-11T02:36:10Z) - OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images [26.37802649901314]
Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation.
We propose an end-to-end transformer-based oriented object detector consisting of three dedicated modules to address these issues.
Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_50$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$times$ to 1$times$.
arXiv Detail & Related papers (2024-09-29T10:36:33Z) - Towards End-to-End Semi-Supervised Table Detection with Semantic Aligned Matching Transformer [12.042768320132694]
Table detection within document images is a crucial task in document processing, involving the identification and localization of tables.
Recent strides in deep learning have substantially improved the accuracy of this task, but it still relies on large labeled datasets for effective training.
We introduce a semi-supervised approach employing SAM-DETR, a novel approach for precise alignment between object queries and target features.
arXiv Detail & Related papers (2024-04-30T20:25:57Z) - Spatial-Temporal Graph Enhanced DETR Towards Multi-Frame 3D Object Detection [54.041049052843604]
We present STEMD, a novel end-to-end framework that enhances the DETR-like paradigm for multi-frame 3D object detection.
First, to model the inter-object spatial interaction and complex temporal dependencies, we introduce the spatial-temporal graph attention network.
Finally, it poses a challenge for the network to distinguish between the positive query and other highly similar queries that are not the best match.
arXiv Detail & Related papers (2023-07-01T13:53:14Z) - End-to-End Lane detection with One-to-Several Transformer [6.79236957488334]
O2SFormer converges 12.5x faster than DETR for the ResNet18 backbone.
O2SFormer with ResNet50 backbone achieves 77.83% F1 score on CULane dataset, outperforming existing Transformer-based and CNN-based detectors.
arXiv Detail & Related papers (2023-05-01T06:07:11Z) - Enhanced Training of Query-Based Object Detection via Selective Query
Recollection [35.3219210570517]
This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage.
We design and present Selective Query Recollection, a simple and effective training strategy for query-based object detectors.
arXiv Detail & Related papers (2022-12-15T02:45:57Z) - End-to-end Tracking with a Multi-query Transformer [96.13468602635082]
Multiple-object tracking (MOT) is a challenging task that requires simultaneous reasoning about location, appearance, and identity of the objects in the scene over time.
Our aim in this paper is to move beyond tracking-by-detection approaches, to class-agnostic tracking that performs well also for unknown object classes.
arXiv Detail & Related papers (2022-10-26T10:19:37Z) - Dynamic Focus-aware Positional Queries for Semantic Segmentation [94.6834904076914]
We propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries.
Our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones.
arXiv Detail & Related papers (2022-04-04T05:16:41Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - MD-CSDNetwork: Multi-Domain Cross Stitched Network for Deepfake
Detection [80.83725644958633]
Current deepfake generation methods leave discriminative artifacts in the frequency spectrum of fake images and videos.
We present a novel approach, termed as MD-CSDNetwork, for combining the features in the spatial and frequency domains to mine a shared discriminative representation.
arXiv Detail & Related papers (2021-09-15T14:11:53Z) - MRDet: A Multi-Head Network for Accurate Oriented Object Detection in
Aerial Images [51.227489316673484]
We propose an arbitrary-oriented region proposal network (AO-RPN) to generate oriented proposals transformed from horizontal anchors.
To obtain accurate bounding boxes, we decouple the detection task into multiple subtasks and propose a multi-head network.
Each head is specially designed to learn the features optimal for the corresponding task, which allows our network to detect objects accurately.
arXiv Detail & Related papers (2020-12-24T06:36:48Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.