Related papers: SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation

URL: http://arxiv.org/abs/2507.12857v2
Date: Tue, 29 Jul 2025 04:03:49 GMT
Title: SCORE: Scene Context Matters in Open-Vocabulary Remote Sensing Instance Segmentation
Authors: Shiqi Huang, Shuting He, Huaiyuan Qin, Bihan Wen,
Abstract summary: We introduce open-vocabulary learning for remote sensing instance segmentation.<n>We propose $textbfSCORE$ ($textbfS$cene $textbfC$ontext matters in $textbfO$pen-vocabulary $textbfRE$mote sensing instance segmentation.<n>Our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis.
Score: 21.428492178304392
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Most existing remote sensing instance segmentation approaches are designed for close-vocabulary prediction, limiting their ability to recognize novel categories or generalize across datasets. This restricts their applicability in diverse Earth observation scenarios. To address this, we introduce open-vocabulary (OV) learning for remote sensing instance segmentation. While current OV segmentation models perform well on natural image datasets, their direct application to remote sensing faces challenges such as diverse landscapes, seasonal variations, and the presence of small or ambiguous objects in aerial imagery. To overcome these challenges, we propose $\textbf{SCORE}$ ($\textbf{S}$cene $\textbf{C}$ontext matters in $\textbf{O}$pen-vocabulary $\textbf{RE}$mote sensing instance segmentation), a framework that integrates multi-granularity scene context, i.e., regional context and global context, to enhance both visual and textual representations. Specifically, we introduce Region-Aware Integration, which refines class embeddings with regional context to improve object distinguishability. Additionally, we propose Global Context Adaptation, which enriches naive text embeddings with remote sensing global context, creating a more adaptable and expressive linguistic latent space for the classifier. We establish new benchmarks for OV remote sensing instance segmentation across diverse datasets. Experimental results demonstrate that, our proposed method achieves SOTA performance, which provides a robust solution for large-scale, real-world geospatial analysis. Our code is available at https://github.com/HuangShiqi128/SCORE.

Related papers

From Open-Vocabulary to Vocabulary-Free Semantic Segmentation [78.62232202171919]
Open-vocabulary semantic segmentation enables models to identify novel object categories beyond their training data.<n>Current approaches still rely on manually specified class names as input, creating an inherent bottleneck in real-world applications.<n>This work proposes a Vocabulary-Free Semantic pipeline, eliminating the need for predefined class vocabularies.
arXiv Detail & Related papers (2025-02-17T15:17:08Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z)
Relation-aware Hierarchical Prompt for Open-vocabulary Scene Graph Generation [14.82606425343802]
Open-vocabulary Scene Graph Generation (OV-SGG) overcomes the limitations of the closed-set assumption by aligning visual relationship representations with open-vocabulary textual representations.<n>Existing OV-SGG methods are constrained by fixed text representations, limiting diversity and accuracy in image-text alignment.<n>We propose the Relation-Aware Hierarchical Prompting (RAHP) framework, which enhances text representation by integrating subject-object and region-specific relation information.
arXiv Detail & Related papers (2024-12-26T02:12:37Z)
ZoRI: Towards Discriminative Zero-Shot Remote Sensing Instance Segmentation [23.40908829241552]
We propose a novel task called zero-shot remote sensing instance segmentation, aimed at identifying aerial objects that are absent from training data.<n>We introduce a knowledge-injected adaptation strategy that decouples semantic-related information to preserve the pretrained vision-language alignment.<n>We establish new experimental protocols and benchmarks, and extensive experiments convincingly demonstrate that ZoRI achieves the state-of-art performance.
arXiv Detail & Related papers (2024-12-17T11:00:56Z)
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection [5.530212768657544]
We introduce glocal contrastive learning to improve the complementary learning of global and local prompts.<n>The generalization performance of GlocalCLIP in ZSAD was demonstrated on 15 real-world datasets.
arXiv Detail & Related papers (2024-11-09T05:22:13Z)
Text-Video Retrieval with Global-Local Semantic Consistent Learning [122.15339128463715]
We propose a simple yet effective method, Global-Local Semantic Consistent Learning (GLSCL) GLSCL capitalizes on latent shared semantics across modalities for text-video retrieval. Our method achieves comparable performance with SOTA as well as being nearly 220 times faster in terms of computational cost.
arXiv Detail & Related papers (2024-05-21T11:59:36Z)
Aligning and Prompting Everything All at Once for Universal Visual Perception [79.96124061108728]
APE is a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks. APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection. Experiments on over 160 datasets demonstrate that APE outperforms state-of-the-art models.
arXiv Detail & Related papers (2023-12-04T18:59:50Z)
Open-Vocabulary Camouflaged Object Segmentation [66.94945066779988]
We introduce a new task, open-vocabulary camouflaged object segmentation (OVCOS) We construct a large-scale complex scene dataset (textbfOVCamo) containing 11,483 hand-selected images with fine annotations and corresponding object classes. By integrating the guidance of class semantic knowledge and the supplement of visual structure cues from the edge and depth information, the proposed method can efficiently capture camouflaged objects.
arXiv Detail & Related papers (2023-11-19T06:00:39Z)
TextFormer: A Query-based End-to-End Text Spotter with Mixed Supervision [61.186488081379]
We propose TextFormer, a query-based end-to-end text spotter with Transformer architecture. TextFormer builds upon an image encoder and a text decoder to learn a joint semantic understanding for multi-task modeling. It allows for mutual training and optimization of classification, segmentation, and recognition branches, resulting in deeper feature sharing.
arXiv Detail & Related papers (2023-06-06T03:37:41Z)
Text2Seg: Remote Sensing Image Semantic Segmentation via Text-Guided Visual Foundation Models [7.452422412106768]
We propose a novel method named Text2Seg for remote sensing semantic segmentation. It overcomes the dependency on extensive annotations by employing an automatic prompt generation process. We show that Text2Seg significantly improves zero-shot prediction performance compared to the vanilla SAM model.
arXiv Detail & Related papers (2023-04-20T18:39:41Z)
Vision Transformers: From Semantic Segmentation to Dense Prediction [139.15562023284187]
We explore the global context learning potentials of vision transformers (ViTs) for dense visual prediction. Our motivation is that through learning global context at full receptive field layer by layer, ViTs may capture stronger long-range dependency information. We formulate a family of Hierarchical Local-Global (HLG) Transformers, characterized by local attention within windows and global-attention across windows in a pyramidal architecture.
arXiv Detail & Related papers (2022-07-19T15:49:35Z)
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval [66.2075707179047]
We propose a novel mixture-of-expert transformer RoME that disentangles the text and the video into three levels. We utilize a transformer-based attention mechanism to fully exploit visual and text embeddings at both global and local levels. Our method outperforms the state-of-the-art methods on the YouCook2 and MSR-VTT datasets.
arXiv Detail & Related papers (2022-06-26T11:12:49Z)
Learning to Aggregate Multi-Scale Context for Instance Segmentation in Remote Sensing Images [28.560068780733342]
A novel context aggregation network (CATNet) is proposed to improve the feature extraction process. The proposed model exploits three lightweight plug-and-play modules, namely dense feature pyramid network (DenseFPN), spatial context pyramid ( SCP), and hierarchical region of interest extractor (HRoIE)
arXiv Detail & Related papers (2021-11-22T08:55:25Z)
SceneEncoder: Scene-Aware Semantic Segmentation of Point Clouds with A Learnable Scene Descriptor [51.298760338410624]
We propose a SceneEncoder module to impose a scene-aware guidance to enhance the effect of global information. The module predicts a scene descriptor, which learns to represent the categories of objects existing in the scene. We also design a region similarity loss to propagate distinguishing features to their own neighboring points with the same label.
arXiv Detail & Related papers (2020-01-24T16:53:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.