SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
- URL: http://arxiv.org/abs/2504.09644v1
- Date: Sun, 13 Apr 2025 16:36:47 GMT
- Title: SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model
- Authors: Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, Xiangyong Cao,
- Abstract summary: We introduce a new task, ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region.<n>We construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs.<n>SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods.
- Score: 61.97017867656831
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Remote sensing has become critical for understanding environmental dynamics, urban planning, and disaster management. However, traditional remote sensing workflows often rely on explicit segmentation or detection methods, which struggle to handle complex, implicit queries that require reasoning over spatial context, domain knowledge, and implicit user intent. Motivated by this, we introduce a new task, \ie, geospatial pixel reasoning, which allows implicit querying and reasoning and generates the mask of the target region. To advance this task, we construct and release the first large-scale benchmark dataset called EarthReason, which comprises 5,434 manually annotated image masks with over 30,000 implicit question-answer pairs. Moreover, we propose SegEarth-R1, a simple yet effective language-guided segmentation baseline that integrates a hierarchical visual encoder, a large language model (LLM) for instruction parsing, and a tailored mask generator for spatial correlation. The design of SegEarth-R1 incorporates domain-specific adaptations, including aggressive visual token compression to handle ultra-high-resolution remote sensing images, a description projection module to fuse language and multi-scale features, and a streamlined mask prediction pipeline that directly queries description embeddings. Extensive experiments demonstrate that SegEarth-R1 achieves state-of-the-art performance on both reasoning and referring segmentation tasks, significantly outperforming traditional and LLM-based segmentation methods. Our data and code will be released at https://github.com/earth-insights/SegEarth-R1.
Related papers
- SegDesicNet: Lightweight Semantic Segmentation in Remote Sensing with Geo-Coordinate Embeddings for Domain Adaptation [0.5461938536945723]
We propose a novel unsupervised domain adaptation technique for remote sensing semantic segmentation.<n>Our proposed SegDesicNet module regresses the GRID positional encoding of the geo coordinates projected over the unit sphere to obtain the domain loss.<n>Our algorithm seeks to reduce the modeling disparity between artificial neural networks and human comprehension of the physical world.
arXiv Detail & Related papers (2025-03-11T11:01:18Z) - Think Before You Segment: High-Quality Reasoning Segmentation with GPT Chain of Thoughts [64.93416171745693]
ThinkFirst is a training-free reasoning segmentation framework.
Our approach allows GPT-4o or other powerful MLLMs to generate a detailed, chain-of-thought description of an image.
This summarized description is then passed to a language-instructed segmentation assistant to aid the segmentation process.
arXiv Detail & Related papers (2025-03-10T16:26:11Z) - RSRefSeg: Referring Remote Sensing Image Segmentation with Foundation Models [24.67117013862316]
Referring remote sensing image segmentation is crucial for achieving fine-grained visual understanding.
We introduce a referring remote sensing image segmentation foundational model, RSRefSeg.
Experimental results on the RRSIS-D dataset demonstrate that RSRefSeg outperforms existing methods.
arXiv Detail & Related papers (2025-01-12T13:22:35Z) - Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.
We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.
Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z) - RG-SAN: Rule-Guided Spatial Awareness Network for End-to-End 3D Referring Expression Segmentation [72.95147072227998]
3D Referring Expression aims to segment 3D objects by correlating referring expressions with point clouds.<n>Traditional approaches frequently encounter issues like over-segmentation or mis-segmentation, due to insufficient emphasis on spatial information of instances.<n>We introduce a Rule-Guided Spatial Awareness Network (RG-SAN) by utilizing solely the spatial information of the target instance for supervision.
arXiv Detail & Related papers (2024-12-03T11:50:16Z) - GLaMM: Pixel Grounding Large Multimodal Model [57.91763410032292]
We present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks.
GLaMM is flexible enough to accept both textual and optional visual prompts (region of interest) as input.
Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale.
arXiv Detail & Related papers (2023-11-06T18:59:57Z) - RRSIS: Referring Remote Sensing Image Segmentation [25.538406069768662]
Localizing desired objects from remote sensing images is of great use in practical applications.
Referring image segmentation, which aims at segmenting out the objects to which a given expression refers, has been extensively studied in natural images.
We introduce referring remote sensing image segmentation (RRSIS) to fill in this gap and make some insightful explorations.
arXiv Detail & Related papers (2023-06-14T16:40:19Z) - Zero-shot Referring Image Segmentation with Global-Local Context
Features [8.77461711080319]
Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image.
We propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP.
In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins.
arXiv Detail & Related papers (2023-03-31T06:00:50Z) - Fully and Weakly Supervised Referring Expression Segmentation with
End-to-End Learning [50.40482222266927]
Referring Expression (RES) is aimed at localizing and segmenting the target according to the given language expression.
We propose a parallel position- kernel-segmentation pipeline to better isolate and then interact with the localization and segmentation steps.
Our method is simple but surprisingly effective, outperforming all previous state-of-the-art RES methods on fully- and weakly-supervised settings.
arXiv Detail & Related papers (2022-12-17T08:29:33Z) - AF$_2$: Adaptive Focus Framework for Aerial Imagery Segmentation [86.44683367028914]
Aerial imagery segmentation has some unique challenges, the most critical one among which lies in foreground-background imbalance.
We propose Adaptive Focus Framework (AF$), which adopts a hierarchical segmentation procedure and focuses on adaptively utilizing multi-scale representations.
AF$ has significantly improved the accuracy on three widely used aerial benchmarks, as fast as the mainstream method.
arXiv Detail & Related papers (2022-02-18T10:14:45Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.