Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation
- URL: http://arxiv.org/abs/2602.08206v1
- Date: Mon, 09 Feb 2026 02:09:21 GMT
- Title: Geospatial-Reasoning-Driven Vocabulary-Agnostic Remote Sensing Semantic Segmentation
- Authors: Chufeng Zhou, Jian Wang, Xinyuan Liu, Xiaokang Zhang,
- Abstract summary: Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing.<n>We propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework to guide open-vocabulary segmentation models toward precise mapping.
- Score: 13.743073097114461
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Open-vocabulary semantic segmentation has emerged as a promising research direction in remote sensing, enabling the recognition of diverse land-cover types beyond pre-defined category sets. However, existing methods predominantly rely on the passive mapping of visual features and textual embeddings. This ``appearance-based" paradigm lacks geospatial contextual awareness, leading to severe semantic ambiguity and misclassification when encountering land-cover classes with similar spectral features but distinct semantic attributes. To address this, we propose a Geospatial Reasoning Chain-of-Thought (GR-CoT) framework designed to enhance the scene understanding capabilities of Multimodal Large Language Models (MLLMs), thereby guiding open-vocabulary segmentation models toward precise mapping. The framework comprises two collaborative components: an offline knowledge distillation stream and an online instance reasoning stream. The offline stream establishes fine-grained category interpretation standards to resolve semantic conflicts between similar land-cover types. During online inference, the framework executes a sequential reasoning process involving macro-scenario anchoring, visual feature decoupling, and knowledge-driven decision synthesis. This process generates an image-adaptive vocabulary that guides downstream models to achieve pixel-level alignment with correct geographical semantics. Extensive experiments on the LoveDA and GID5 benchmarks demonstrate the superiority of our approach.
Related papers
- Characterizing Human Semantic Navigation in Concept Production as Trajectories in Embedding Space [0.0]
We introduce a framework that represents concept production as navigation through embedding space.<n>We construct participant-specific semantic trajectories based on cumulative embeddings and extract geometric and dynamical metrics.<n>We evaluate the framework on four datasets across different languages, spanning different property generation tasks.
arXiv Detail & Related papers (2026-02-05T18:23:04Z) - Mind-the-Glitch: Visual Correspondence for Detecting Inconsistencies in Subject-Driven Generation [120.23172120151821]
We propose a novel approach for disentangling visual and semantic features from the backbones of pre-trained diffusion models.<n>We introduce an automated pipeline that constructs image pairs with annotated semantic and visual correspondences.<n>We propose a new metric, Visual Semantic Matching, that quantifies visual inconsistencies in subject-driven image generation.
arXiv Detail & Related papers (2025-09-26T07:11:55Z) - GS: Generative Segmentation via Label Diffusion [59.380173266566715]
Language-driven image segmentation is a fundamental task in vision-language understanding, requiring models to segment regions of an image corresponding to natural language expressions.<n>Recent diffusion models have been introduced to this domain, but existing approaches remain image-centric.<n>We propose GS (Generative Label), a novel framework that formulates segmentation itself as a generative task.<n> Experimental results show that GS significantly outperforms existing discriminative and diffusion-based methods, setting a new state-of-the-art for language-driven segmentation.
arXiv Detail & Related papers (2025-08-27T16:28:15Z) - Constrained Prompt Enhancement for Improving Zero-Shot Generalization of Vision-Language Models [57.357091028792325]
Vision-language models (VLMs) pre-trained on web-scale data exhibit promising zero-shot generalization but often suffer from semantic misalignment.<n>We propose a novel constrained prompt enhancement (CPE) method to improve visual-textual alignment.<n>Our approach consists of two key components: Topology-Guided Synonymous Semantic Generation (TGSSG) and Category-Agnostic Discriminative Region Selection (CADRS)
arXiv Detail & Related papers (2025-08-24T15:45:22Z) - HyperPath: Knowledge-Guided Hyperbolic Semantic Hierarchy Modeling for WSI Analysis [21.380034877048644]
We propose HyperPath, a novel method that integrates knowledge from textual descriptions to guide the modeling of semantic hierarchies in hyperbolic space.<n>Our approach adapts both visual and textual features extracted by pathology vision-language foundation models to the hyperbolic space.<n>Our method achieves superior performance across tasks compared to existing methods, highlighting the potential of hyperbolic embeddings for WSI analysis.
arXiv Detail & Related papers (2025-06-19T15:30:33Z) - Semantic-Space-Intervened Diffusive Alignment for Visual Classification [11.621655970763467]
Cross-modal alignment is an effective approach to improving visual classification.<n>This paper proposes a novel Semantic-Space-Intervened Diffusive Alignment method, termed SeDA.<n> Experimental results show that SeDA achieves stronger cross-modal feature alignment, leading to superior performance over existing methods.
arXiv Detail & Related papers (2025-05-09T01:41:23Z) - DeepSPG: Exploring Deep Semantic Prior Guidance for Low-light Image Enhancement with Multimodal Learning [0.0]
We propose a new deep semantic prior-guided framework (DeepSPG) based on Retinex image decomposition for low-light image enhancement (LLIE)<n>Our proposed DeepSPG demonstrates superior performance compared to state-of-the-art methods across five benchmark datasets.
arXiv Detail & Related papers (2025-04-27T06:56:07Z) - Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation [44.008094698200026]
FreeDA is a training-free diffusion-augmented method for open-vocabulary semantic segmentation.
FreeDA achieves state-of-the-art performance on five datasets.
arXiv Detail & Related papers (2024-04-09T18:00:25Z) - Leveraging Open-Vocabulary Diffusion to Camouflaged Instance
Segmentation [59.78520153338878]
Text-to-image diffusion techniques have shown exceptional capability of producing high-quality images from text descriptions.
We propose a method built upon a state-of-the-art diffusion model, empowered by open-vocabulary to learn multi-scale textual-visual features for camouflaged object representations.
arXiv Detail & Related papers (2023-12-29T07:59:07Z) - Graph Adaptive Semantic Transfer for Cross-domain Sentiment
Classification [68.06496970320595]
Cross-domain sentiment classification (CDSC) aims to use the transferable semantics learned from the source domain to predict the sentiment of reviews in the unlabeled target domain.
We present Graph Adaptive Semantic Transfer (GAST) model, an adaptive syntactic graph embedding method that is able to learn domain-invariant semantics from both word sequences and syntactic graphs.
arXiv Detail & Related papers (2022-05-18T07:47:01Z) - Integrating Language Guidance into Vision-based Deep Metric Learning [78.18860829585182]
We propose to learn metric spaces which encode semantic similarities as embedding space.
These spaces should be transferable to classes beyond those seen during training.
This causes learned embedding spaces to encode incomplete semantic context and misrepresent the semantic relation between classes.
arXiv Detail & Related papers (2022-03-16T11:06:50Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.