AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
- URL: http://arxiv.org/abs/2504.07836v2
- Date: Fri, 11 Apr 2025 01:47:14 GMT
- Title: AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations
- Authors: Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, Bin Zhao,
- Abstract summary: Visual grounding aims to localize target objects in an image based on natural language descriptions.<n>AerialVG poses new challenges, emphe.g., appearance-based grounding is insufficient to distinguish among multiple visually similar objects.<n>We introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects.
- Score: 42.75895237875992
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Visual grounding (VG) aims to localize target objects in an image based on natural language descriptions. In this paper, we propose AerialVG, a new task focusing on visual grounding from aerial views. Compared to traditional VG, AerialVG poses new challenges, \emph{e.g.}, appearance-based grounding is insufficient to distinguish among multiple visually similar objects, and positional relations should be emphasized. Besides, existing VG models struggle when applied to aerial imagery, where high-resolution images cause significant difficulties. To address these challenges, we introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects. Particularly, each annotation in AerialVG dataset contains multiple target objects annotated with relative spatial relations, requiring models to perform comprehensive spatial reasoning. Furthermore, we propose an innovative model especially for the AerialVG task, where a Hierarchical Cross-Attention is devised to focus on target regions, and a Relation-Aware Grounding module is designed to infer positional relations. Experimental results validate the effectiveness of our dataset and method, highlighting the importance of spatial reasoning in aerial visual grounding. The code and dataset will be released.
Related papers
- A Deep Learning Framework with Geographic Information Adaptive Loss for Remote Sensing Images based UAV Self-Positioning [10.16507150219648]
Self-positioning of UAVs in GPS-denied environments has become a critical objective.<n>We present a deep learning framework with geographic information adaptive loss to achieve precise localization.<n>Results demonstrate the method's efficacy in enabling UAVs to achieve precise self-positioning.
arXiv Detail & Related papers (2025-02-22T09:36:34Z) - Style Alignment based Dynamic Observation Method for UAV-View Geo-localization [7.185123213523453]
We propose a style alignment based dynamic observation method for UAV-view geo-localization.
Specifically, we introduce a style alignment strategy to transfrom the diverse visual style of drone-view images into a unified satellite images visual style.
A dynamic observation module is designed to evaluate the spatial distribution of images by mimicking human observation habits.
arXiv Detail & Related papers (2024-07-03T06:19:42Z) - SOOD++: Leveraging Unlabeled Data to Boost Oriented Object Detection [59.868772767818975]
We propose a simple yet effective Semi-supervised Oriented Object Detection method termed SOOD++.
Specifically, we observe that objects from aerial images are usually arbitrary orientations, small scales, and aggregation.
Extensive experiments conducted on various multi-oriented object datasets under various labeled settings demonstrate the effectiveness of our method.
arXiv Detail & Related papers (2024-07-01T07:03:51Z) - GOMAA-Geo: GOal Modality Agnostic Active Geo-localization [49.599465495973654]
We consider the task of active geo-localization (AGL) in which an agent uses a sequence of visual cues observed during aerial navigation to find a target specified through multiple possible modalities.
GOMAA-Geo is a goal modality active geo-localization agent for zero-shot generalization between different goal modalities.
arXiv Detail & Related papers (2024-06-04T02:59:36Z) - TK-Planes: Tiered K-Planes with High Dimensional Feature Vectors for Dynamic UAV-based Scenes [58.180556221044235]
We present a new approach to bridge the domain gap between synthetic and real-world data for unmanned aerial vehicle (UAV)-based perception.
Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions.
We evaluate its performance on challenging datasets, including Okutama Action and UG2.
arXiv Detail & Related papers (2024-05-04T21:55:33Z) - EarthVQA: Towards Queryable Earth via Relational Reasoning-Based Remote
Sensing Visual Question Answering [11.37120215795946]
We develop a multi-modal multi-task VQA dataset (EarthVQA) to advance relational reasoning-based judging, counting, and comprehensive analysis.
The EarthVQA dataset contains 6000 images, corresponding semantic masks, and 208,593 QA pairs with urban and rural governance requirements embedded.
We propose a Semantic OBject Awareness framework (SOBA) to advance VQA in an object-centric way.
arXiv Detail & Related papers (2023-12-19T15:11:32Z) - Multiview Aerial Visual Recognition (MAVREC): Can Multi-view Improve
Aerial Visual Perception? [57.77643186237265]
We present Multiview Aerial Visual RECognition or MAVREC, a video dataset where we record synchronized scenes from different perspectives.
MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences, more than 0.5 million frames, and 1.1 million annotated bounding boxes.
This makes MAVREC the largest ground and aerial-view dataset, and the fourth largest among all drone-based datasets.
arXiv Detail & Related papers (2023-12-07T18:59:14Z) - Innovative Horizons in Aerial Imagery: LSKNet Meets DiffusionDet for
Advanced Object Detection [55.2480439325792]
We present an in-depth evaluation of an object detection model that integrates the LSKNet backbone with the DiffusionDet head.
The proposed model achieves a mean average precision (MAP) of approximately 45.7%, which is a significant improvement.
This advancement underscores the effectiveness of the proposed modifications and sets a new benchmark in aerial image analysis.
arXiv Detail & Related papers (2023-11-21T19:49:13Z) - Progressive Domain Adaptation with Contrastive Learning for Object
Detection in the Satellite Imagery [0.0]
State-of-the-art object detection methods largely fail to identify small and dense objects.
We propose a small object detection pipeline that improves the feature extraction process.
We show we can alleviate the degradation of object identification in previously unseen datasets.
arXiv Detail & Related papers (2022-09-06T15:16:35Z) - Co-visual pattern augmented generative transformer learning for
automobile geo-localization [12.449657263683337]
Cross-view geo-localization (CVGL) aims to estimate the geographical location of the ground-level camera by matching against enormous geo-tagged aerial images.
We present a novel approach using cross-view knowledge generative techniques in combination with transformers, namely mutual generative transformer learning (MGTL) for CVGL.
arXiv Detail & Related papers (2022-03-17T07:29:02Z) - Suspected Object Matters: Rethinking Model's Prediction for One-stage
Visual Grounding [93.82542533426766]
We propose a Suspected Object Transformation mechanism (SOT) to encourage the target object selection among the suspected ones.
SOT can be seamlessly integrated into existing CNN and Transformer-based one-stage visual grounders.
Extensive experiments demonstrate the effectiveness of our proposed method.
arXiv Detail & Related papers (2022-03-10T06:41:07Z) - Vision-Based UAV Self-Positioning in Low-Altitude Urban Environments [20.69412701553767]
Unmanned Aerial Vehicles (UAVs) rely on satellite systems for stable positioning.
In such situations, vision-based techniques can serve as an alternative, ensuring the self-positioning capability of UAVs.
This paper presents a new dataset, DenseUAV, which is the first publicly available dataset designed for the UAV self-positioning task.
arXiv Detail & Related papers (2022-01-23T07:18:55Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.