Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships
- URL: http://arxiv.org/abs/2404.04140v2
- Date: Thu, 28 Nov 2024 12:20:39 GMT
- Title: Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships
- Authors: Botao Ren, Botian Xu, Xue Yang, Yifan Pu, Jingyi Wang, Zhidong Deng,
- Abstract summary: In most modern object detection pipelines, the detection proposals are processed independently given the feature map.
We propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training features.
Our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP$_50$ on DOTA-v1.0, 5.30 mAP$_50$ on DOTA-v1.5, 2.30 mAP$_50$ on DOTA-v2.0 and 3.23 mAP$_50$ on DI
- Score: 10.416444779297148
- License:
- Abstract: In most modern object detection pipelines, the detection proposals are processed independently given the feature map. Therefore, they overlook the underlying relationships between objects and the surrounding background, which could have provided additional context for accurate detection. Because aerial imagery is almost orthographic, the spatial relations in image space closely align with those in the physical world, and inter-object and object-background relationships become particularly significant. To address this oversight, we propose a framework that leverages the strengths of Transformer-based models and Contrastive Language-Image Pre-training (CLIP) features to capture such relationships. Specifically, Building on two-stage detectors, we treat Region of Interest (RoI) proposals as tokens, accompanied by CLIP Tokens obtained from multi-level image segments. These tokens are then passed through a Transformer encoder, where specific spatial and geometric relations are incorporated into the attention weights, which are adaptively modulated and regularized. Additionally, we introduce self-supervised constraints on CLIP Tokens to ensure consistency. Extensive experiments on three benchmark datasets demonstrate that our approach achieves consistent improvements, setting new state-of-the-art results with increases of 1.37 mAP$_{50}$ on DOTA-v1.0, 5.30 mAP$_{50}$ on DOTA-v1.5, 2.30 mAP$_{50}$ on DOTA-v2.0 and 3.23 mAP$_{50}$ on DIOR-R.
Related papers
- LP-DETR: Layer-wise Progressive Relations for Object Detection [4.632366780742503]
LP-DETR (Layer-wise Progressive DETR) is a novel approach that enhances DETR-based object detection through multi-scale relation modeling.
Our method introduces learnable spatial relationships between object queries through a relation-aware self-attention mechanism.
arXiv Detail & Related papers (2025-02-07T18:25:28Z) - OrientedFormer: An End-to-End Transformer-Based Oriented Object Detector in Remote Sensing Images [26.37802649901314]
Oriented object detection in remote sensing images is a challenging task due to objects being distributed in multi-orientation.
We propose an end-to-end transformer-based oriented object detector consisting of three dedicated modules to address these issues.
Compared with previous end-to-end detectors, the OrientedFormer gains 1.16 and 1.21 AP$_50$ on DIOR-R and DOTA-v1.0 respectively, while reducing training epochs from 3$times$ to 1$times$.
arXiv Detail & Related papers (2024-09-29T10:36:33Z) - Relation DETR: Exploring Explicit Position Relation Prior for Object Detection [26.03892270020559]
We present a scheme for enhancing the convergence and performance of DETR (DEtection TRansformer)
Our approach, termed Relation-DETR, introduces an encoder to construct position relation embeddings for progressive attention refinement.
Experiments on both generic and task-specific datasets demonstrate the effectiveness of our approach.
arXiv Detail & Related papers (2024-07-16T13:17:07Z) - SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection [59.868772767818975]
We propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++.
Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation.
Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-01T07:03:51Z) - MMScan: A Multi-Modal 3D Scene Dataset with Hierarchical Grounded Language Annotations [55.022519020409405]
This paper builds the first largest ever multi-modal 3D scene dataset and benchmark with hierarchical grounded language annotations, MMScan.
The resulting multi-modal 3D dataset encompasses 1.4M meta-annotated captions on 109k objects and 7.7k regions as well as over 3.04M diverse samples for 3D visual grounding and question-answering benchmarks.
arXiv Detail & Related papers (2024-06-13T17:59:30Z) - Geometric-aware Pretraining for Vision-centric 3D Object Detection [77.7979088689944]
We propose a novel geometric-aware pretraining framework called GAPretrain.
GAPretrain serves as a plug-and-play solution that can be flexibly applied to multiple state-of-the-art detectors.
We achieve 46.2 mAP and 55.5 NDS on the nuScenes val set using the BEVFormer method, with a gain of 2.7 and 2.1 points, respectively.
arXiv Detail & Related papers (2023-04-06T14:33:05Z) - Unified Visual Relationship Detection with Vision and Language Models [89.77838890788638]
This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets.
We propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models.
Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model.
arXiv Detail & Related papers (2023-03-16T00:06:28Z) - Ret3D: Rethinking Object Relations for Efficient 3D Object Detection in
Driving Scenes [82.4186966781934]
We introduce a simple, efficient, and effective two-stage detector, termed as Ret3D.
At the core of Ret3D is the utilization of novel intra-frame and inter-frame relation modules.
With negligible extra overhead, Ret3D achieves the state-of-the-art performance.
arXiv Detail & Related papers (2022-08-18T03:48:58Z) - Bridging the Gap between Object and Image-level Representations for
Open-Vocabulary Detection [54.96069171726668]
Two popular forms of weak-supervision used in open-vocabulary detection (OVD) include pretrained CLIP model and image-level supervision.
We propose to address this problem by performing object-centric alignment of the language embeddings from the CLIP model.
We establish a bridge between the above two object-alignment strategies via a novel weight transfer function.
arXiv Detail & Related papers (2022-07-07T17:59:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.