Generalized Referring Expression Segmentation on Aerial Photos
- URL: http://arxiv.org/abs/2512.07338v1
- Date: Mon, 08 Dec 2025 09:25:59 GMT
- Title: Generalized Referring Expression Segmentation on Aerial Photos
- Authors: Luís Marnoto, Alexandre Bernardino, Bruno Martins,
- Abstract summary: This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery.<n>It comprises 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions.<n>We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images.
- Score: 47.944645462877894
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (e.g., modern aerial photos collected through drones, historical photos from aerial archives, high-resolution satellite imagery, etc.) presents unique challenges because spatial resolution varies widely across datasets, the use of color is not consistent, targets often shrink to only a few pixels, and scenes contain very high object densities and objects with partial occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery, comprising 37,288 images with 1,522,523 referring expressions that cover 259,709 annotated targets, spanning across individual object instances, groups of instances, and semantic regions covering 21 distinct classes that range from vehicles and infrastructure to land coverage types. The dataset was constructed through a fully automatic pipeline that combines systematic rule-based expression generation with a Large Language Model (LLM) enhancement procedure that enriched both the linguistic variety and the focus on visual details within the referring expressions. Filters were additionally used to simulate historic imaging conditions for each scene. We adopted the RSRefSeg architecture, and trained models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historical images. Results show that the combined training achieves competitive performance on contemporary benchmarks, while maintaining strong accuracy under monochrome, sepia, and grainy degradations that appear in archival aerial photography. The dataset, trained models, and complete software pipeline are publicly available at https://luispl77.github.io/aerial-d .
Related papers
- Cross-View Open-Vocabulary Object Detection in Aerial Imagery [48.851422992413184]
We propose a novel framework for adapting open-vocabulary representations from ground-view images to solve object detection in aerial imagery.<n>The method introduces contrastive image-to-image alignment to enhance the similarity between aerial and ground-view embeddings.<n>Our open-vocabulary model achieves improvements of +6.32 mAP on DOTAv2, +4.16 mAP on VisDrone (Images), and +3.46 mAP on HRRSD in the zero-shot setting.
arXiv Detail & Related papers (2025-10-04T16:12:03Z) - DescribeEarth: Describe Anything for Remote Sensing Images [56.04533626223295]
We propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing.<n>To support this task, we construct DE-Dataset, a large-scale dataset with detailed descriptions of object attributes, relationships, and contexts.<n>We also present DescribeEarth, a Multi-modal Large Language Model architecture explicitly designed for Geo-DLC.
arXiv Detail & Related papers (2025-09-30T01:53:34Z) - AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations [51.44608822712786]
Visual grounding aims to localize target objects in an image based on natural language descriptions.<n>AerialVG poses new challenges, emphe.g., appearance-based grounding is insufficient to distinguish among multiple visually similar objects.<n>We introduce the first AerialVG dataset, consisting of 5K real-world aerial images, 50K manually annotated descriptions, and 103K objects.
arXiv Detail & Related papers (2025-04-10T15:13:00Z) - A Recipe for Improving Remote Sensing VLM Zero Shot Generalization [0.4427533728730559]
We present two novel image-caption datasets for training of remote sensing foundation models.<n>The first dataset pairs aerial and satellite imagery with captions generated by Gemini using landmarks extracted from Google Maps.<n>The second dataset utilizes public web images and their corresponding alt-text, filtered for the remote sensing domain.
arXiv Detail & Related papers (2025-03-10T21:09:02Z) - AeroReformer: Aerial Referring Transformer for UAV-based Referring Image Segmentation [9.55871636831991]
We propose a novel framework for UAV referring image segmentation (UAV-RIS)<n>AeroReformer features a Vision-Language Cross-Attention Module (VLCAM) for effective cross-modal understanding and a Rotation-Aware Multi-Scale Fusion decoder.<n>Experiments on two newly developed datasets demonstrate the superiority of AeroReformer over existing methods.
arXiv Detail & Related papers (2025-02-23T18:49:00Z) - GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis [17.83602731408318]
We introduce GAIA, a novel dataset for multi-scale, multi-sensor, and multi-modal Remote Sensing (RS) image analysis.<n>GAIA comprises of 205,150 meticulously curated RS image-text pairs, representing a diverse range of RS modalities associated to different spatial resolutions.<n>GAIA significantly improves performance on RS image classification, cross-modal retrieval and image captioning tasks.
arXiv Detail & Related papers (2025-02-13T18:52:14Z) - Salient Objects in Clutter [130.63976772770368]
This paper identifies and addresses a serious design bias of existing salient object detection (SOD) datasets.
This design bias has led to a saturation in performance for state-of-the-art SOD models when evaluated on existing datasets.
We propose a new high-quality dataset and update the previous saliency benchmark.
arXiv Detail & Related papers (2021-05-07T03:49:26Z) - PhraseCut: Language-based Image Segmentation in the Wild [62.643450401286]
We consider the problem of segmenting image regions given a natural language phrase.
Our dataset is collected on top of the Visual Genome dataset.
Our experiments show that the scale and diversity of concepts in our dataset poses significant challenges to the existing state-of-the-art.
arXiv Detail & Related papers (2020-08-03T20:58:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.