Related papers: DescribeEarth: Describe Anything for Remote Sensing Images

DescribeEarth: Describe Anything for Remote Sensing Images

URL: http://arxiv.org/abs/2509.25654v1
Date: Tue, 30 Sep 2025 01:53:34 GMT
Title: DescribeEarth: Describe Anything for Remote Sensing Images
Authors: Kaiyu Li, Zixuan Jiang, Xiangyong Cao, Jiayu Wang, Yuchen Xiao, Deyu Meng, Zhi Wang,
Abstract summary: We propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing.<n>To support this task, we construct DE-Dataset, a large-scale dataset with detailed descriptions of object attributes, relationships, and contexts.<n>We also present DescribeEarth, a Multi-modal Large Language Model architecture explicitly designed for Geo-DLC.
Score: 56.04533626223295
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Automated textual description of remote sensing images is crucial for unlocking their full potential in diverse applications, from environmental monitoring to urban planning and disaster management. However, existing studies in remote sensing image captioning primarily focus on the image level, lacking object-level fine-grained interpretation, which prevents the full utilization and transformation of the rich semantic and structural information contained in remote sensing images. To address this limitation, we propose Geo-DLC, a novel task of object-level fine-grained image captioning for remote sensing. To support this task, we construct DE-Dataset, a large-scale dataset contains 25 categories and 261,806 annotated instances with detailed descriptions of object attributes, relationships, and contexts. Furthermore, we introduce DE-Benchmark, a LLM-assisted question-answering based evaluation suite designed to systematically measure model capabilities on the Geo-DLC task. We also present DescribeEarth, a Multi-modal Large Language Model (MLLM) architecture explicitly designed for Geo-DLC, which integrates a scale-adaptive focal strategy and a domain-guided fusion module leveraging remote sensing vision-language model features to encode high-resolution details and remote sensing category priors while maintaining global context. Our DescribeEarth model consistently outperforms state-of-the-art general MLLMs on DE-Benchmark, demonstrating superior factual accuracy, descriptive richness, and grammatical soundness, particularly in capturing intrinsic object features and surrounding environmental attributes across simple, complex, and even out-of-distribution remote sensing scenarios. All data, code and weights are released at https://github.com/earth-insights/DescribeEarth.

Related papers

ObjEmbed: Towards Universal Multimodal Object Embeddings [74.39703419628829]
We present Embed, a novel individual object embedding model.<n>It decomposes the input image into multiple regional embeddings, each corresponding to an individual object, along with global embeddings.<n>It supports a wide range of visual understanding tasks like visual retrieval, local image retrieval, and global image retrieval.
arXiv Detail & Related papers (2026-02-02T07:38:45Z)
DenseWorld-1M: Towards Detailed Dense Grounded Caption in the Real World [68.39362698871503]
We present DenseWorld-1M, the first massive, detailed, dense grounded caption dataset in the real world.<n>We design a three-stage labeling pipeline, containing open-world perception, detailed object caption generation, and dense caption merging.<n>To accelerate the labeling process and improve caption quality, we present two VLM models: the Detailed Region Caption model and the Spatial Caption Merging model.
arXiv Detail & Related papers (2025-06-30T17:51:25Z)
SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model [61.97017867656831]
We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.<n>We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs.<n>SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
arXiv Detail & Related papers (2025-04-13T16:36:47Z)
GeoRSMLLM: A Multimodal Large Language Model for Vision-Language Tasks in Geoscience and Remote Sensing [33.19843463374473]
Vision-Language Models (VLMs) in remote sensing have demonstrated significant potential in traditional tasks.<n>Current models, which excel in Referring Expression (REC), struggle with tasks involving complex instructions.<n>We introduce the Remote Sensing Vision-Language Task Set (RSVLTS), which includes Open-Vocabulary Tasks (OVT), Referring Expression Tasks (RET), and Described Object Tasks (DOT)<n>We propose a novel unified data representation using a set-of-points approach for RSVLTS, along with a condition and a self-augmentation strategy based on cyclic referring.
arXiv Detail & Related papers (2025-03-16T12:48:17Z)
Benchmarking Large Vision-Language Models via Directed Scene Graph for Comprehensive Image Captioning [77.2852342808769]
In this paper, we introduce a detailed caption benchmark, termed as CompreCap, to evaluate the visual context from a directed scene graph view.<n>We first manually segment the image into semantically meaningful regions according to common-object vocabulary, while also distinguishing attributes of objects within all those regions.<n>Then directional relation labels of these objects are annotated to compose a directed scene graph that can well encode rich compositional information of the image.
arXiv Detail & Related papers (2024-12-11T18:37:42Z)
ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models [26.583783910846723]
ChatEarthNet is a large-scale image-text dataset characterized by global coverage, high quality, wide-ranging diversity, and detailed descriptions. ChatEarthNet consists of 163,488 image-text pairs with captions generated by ChatGPT-3.5 and an additional 10,000 image-text pairs with captions generated by ChatGPT-4V(ision)
arXiv Detail & Related papers (2024-02-17T16:38:40Z)
MetaSegNet: Metadata-collaborative Vision-Language Representation Learning for Semantic Segmentation of Remote Sensing Images [7.0622873873577054]
We propose a novel metadata-collaborative segmentation network (MetaSegNet) for semantic segmentation of remote sensing images. Unlike the common model structure that only uses unimodal visual data, we extract the key characteristic from freely available remote sensing image metadata. We construct an image encoder, a text encoder, and a crossmodal attention fusion subnetwork to extract the image and text feature.
arXiv Detail & Related papers (2023-12-20T03:16:34Z)
CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding [38.53988682814626]
We propose a context-enhanced masked image modeling method (CtxMIM) for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset.
arXiv Detail & Related papers (2023-09-28T18:04:43Z)
SSMG: Spatial-Semantic Map Guided Diffusion Model for Free-form Layout-to-Image Generation [68.42476385214785]
We propose a novel Spatial-Semantic Map Guided (SSMG) diffusion model that adopts the feature map, derived from the layout, as guidance. SSMG achieves superior generation quality with sufficient spatial and semantic controllability compared to previous works. We also propose the Relation-Sensitive Attention (RSA) and Location-Sensitive Attention (LSA) mechanisms.
arXiv Detail & Related papers (2023-08-20T04:09:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.